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ABSTRACT 



An application layer protocol is provided on top of HTTP 
1.0/1.1 to allow for COM Automation objects to be invoked 
over the Internet through IIS/ISAPI servers. The format 
essentially encodes the automation object's name, method to 
invoke, and any [in], [out], [in, out] parameters that the 
method signature requires, packages them up into a custom 
MIME type and marshals it to the ISAPI dynamic link 
library (DLL) on the IIS/HTTP server. There, the ISAPI 
DLL contains the logic to unpack the SOAP request, parses 
it, creates the Automation object, invokes the method with 
the marshaled parameters, and then returns any [out] param- 
eters to the caller/client using the SOAP protocol. It is a 
stateless protocol, meaning that object lifetimes only extend 
to one method, and are recreated between multiple calls to 
the object. 
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SIMPLE OBJECT ACCESS PROTOCOL conducted through the World Wide Web, hereinafter referred 

n nc ,™. , k ,, re . m „ k . to as "WWW," or simply the "Web," in which linked pages 

HELD OF THE INVENTION of sUtic ranlcnlj composed of a variety of media, such as 

The present invention relates to the field of Internet text, images, audio, and video, are described using hypertext 
interactivity and, more particularly, to a system for accessing 5 markup language (HTML). While the WWW revolution 
and invoking automation objects over the Internet. opened the doors to a wealth of information at the fingertips 

of ordinary people, and while HTML is a very good way of 
BACKGROUND OF THE INVENTION describing static documents, it provides no means to interact 

In the early days of desktop computing, all applications ^ * ™ eb ? a f ^ In ^ 5 a ? c T**£™* browser 11565 

were monolithic i.e., they were self-contained, standalone '0 StJi^ frn J» W^ ^'"S^"™? , l ° ^ "! 
A i .i ' , .„ H I ML file from a Web server. HTTP is an Internet protocol 

^^wi,^h S pt ° B T S t WCrC n 3 P ? blem SUU f for ra P id and effic *°< delivery of HTML co- 

existed wnh these monolithic applications Development of ments . HTTP is a stateless protocol, meaning that each 
traditional software applications required the application request to the Web server is treated independently, with the 
executables to be compiled and linked with their dependen- server retaining no "memory" of any previous connections 
cies. Thus, every time developers wanted to update the « The Web server receives the request and sends the HTML 
processing logic or implement new capabilities, they would page to the Web browser, which formats and displays the 
have to modify and recompile the entire primary application page. Although this model provides a client with ready 
in order to do so. In essence, in order to make any changes access to nicely formatted pages of information, it provides 
to any portion of the program, the entire application had to only limited interaction between the client and the Web 
be rewritten. This made it impractical to upgrade the appli- 20 server. Furthermore, HTML pages must be manually edited 
cation as minor improvements were made. m order to change what the Web server sends to a client , 

This problem was addressed by the introduction of a such 35 a Web browse r. Thus, much of the potential richness 
component software paradigm. A basic principle of compo- of ,he World Wide Web * not ^ realized, 
nent software is that applications can be built from a series , ° ne of the bi SS est challenges to any Web site is to offer 
of prebuilt and easily developed, understood, and changed 25 dvna . mic content, i.e., content that changes in realtime. This 
software modules called components, each providing a reqwres applications to be run from the Web servers. Chang- 
particular function. Thus, applications could be delivered, m & b ° m a w ^ ln ^f 11 ' to a d y namic web coatent 
enhanced, or extended much more quickly and at a lowe T* 1 ( W °^ d M ° W ^ X ° pr0Vide 

cost simply by updating or adding new components. mteracUve business applications ; rather than merely publish- 

rw„rt.,„»,-i„ iu ™ . , • „ 30 mg pages of static informaUon. For example, a travel agency ■ 

Unfortunately the component software paradigm suffers a could enabk to M m ^ ffi h ^ J 

problem similar to toa. of the monolithic application. Each fares , and reserve ^ on jj^ rather ^ ^ ^ 

hme the components are enhanced and upgraded, as with at fljgij t schedules 

applications, the components must be recompiled by the utttj . ,„ n •. a t i • ■ 

component developers Either the application developer or " ™ we for "fP 1 ^ 0 ^ d y namc W ° b 

me end user would have to monitor for and obtain updated 35 ^ ^ lnteractl °g ^ Web pages potentially 

component. The distributed component paradigm has pro- ^ nt " h * u k ^ ^ * 

vided a solution to this problem ? chent, such as a web browser, is used to mitiate a query, 

. ., , which is sent to an HTTP server operating on a host 

Distributed [components exist at specific locations. Devel- computer somewhere on the Internet. The query might 

opers of applications or oilier components that require a represent a request for documents containing certain data, or 

distributed component need only find the component and may Kprescat ^ addresS) or Uniform rJ^^ 

then use * The developer does not need to compile or (URL), of a particular Web page. The server locates the 

recompile the component. This is done by the creators of the documents and sends their contents back to the client. In 

component. Thus, the latest and greatest version of each i oading me documents for viewing, the client often encoun- 

component is always available to developers and other 4S ters additional files such as embedded images or sounds, that 

need to be loaded. The client continues making requests to 

The widespread use of the Internet, an open environment, the server until all of the additional files are received and 

presents many new opportunities for distributed component loaded. 

software, and some associated shortcomings as well. The Since HTTP is a stateless protocol, as mentioned above 

availability of a vast number of vendors, each creating a so existing HTTP servers create a separate process for each 

number of components increases the ease with which appli- request received. The greater the number of concurrent 

cations may be built and increases the flexibility to tailor an requests, the greater the number of concurrent processes 

application to suits a user's needs. Unfortunately, the open created by the server. Unfortunately, creating a process for 

environment of the Internet means that no one can implicitly every request is time-consuming and requires large amounts 

"trust" everyone else, as is the case with a traditional 55 of server resources such as memory and processor cycles In 

client-server system. Thus, all but some dedicated server addition, creating a process for every request can restrict the 

machines are hidden behind firewalls to protect against server resources available for sharing, slowing down 

unwanted intrusions. Firewalls are barriers that filter packets performance, and increasing wait times 

IZa 2 T? 1 ! Ctil ^. s]ich g a «yP e tf.P"* e «. In summary, since most servers are protected by firewalls, 

™, n , TV, T f? WdlS S ^ C ! d SCrVCIS ^ 60 onl y «H« of P ackels . ««* as HTTP packets, may 

con roUing traffic between the Internet and the server and pass lo ^ ^ since HTTP is not suited for 

controlling which packets may pass through them. interactivity, the goal of providing dynamic content over the 

Since only certain types of packets may pass through to internet is severely limited. Thus, in order to fully realize the 

the server when firewalls are in place, the ability to access potential of distributed component software and of dynamic 

remote components over the Internet is severely limited. 65 content on the World Wide Web, there exists a need for 
A second problem with today's Internet is not so much a software having the ability to access and invoke Automation 

problem as a shortcoming. Much of the Internet's use is objects through firewalls. 
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SUMMARY OF THE INVENTION as the same becomes better understood by reference to the 

In accordance with the present invention, a method and following detailed description, when taken in conjunction 

software program that provides end users and developers with the accompanying drawings, wherein: 

with all the advantages of distributed component software, FIG. 1 depicts an environment in which the present 

and capitalizes on the resources available on a computer s invention operates; 

network such as the Internet, to provide a richer, more „ r , • , , ... 

interactive content is provided. The invention achieves this FIG . 2 is a block diagram of an embodiment of the present 

result by defining a protocol capable of accessing and invention,^ 

invoking methods in Automation objects across the Internet FIG. 3 is a flowchart depicting a method of marshaling 

and through firewalls. The protocol, called a Simple Object !o da,a across the Internet according to the present invention; 

Access Protocol (SOAP), is an ap plication layer protocol FIG. 4 is a block diagram depicting a data structure for a 

that is layered on top of HTTP and allows Microsoft request message according to an embodiment of the present 

Component Object Model (COM) Automation objects to be invention; and 

accessed and methods to be invoked over the Internet «'■ U i i • ■ 

through Web servers protected by firewalls. "Application F1G " 5 * * b '° Ck dlagraiD de P lclm g a data structure for a 

layer" refers to the highest layer in the seven-layer Refer- 15 res P onse messa S e ^cording to an embodiment of the 

ence Model for Open Systems Interconnection (OS1 Refer- presenl invenUon - 

ence Model), an international standard for networking by the „ „^ ^„„„„ 

International Standards Organization (ISO). The application DETAILED DESCRIPTION OF THE 

layer is concerned with the semantics of the information PREFERRED EMBODIMENT 

exchanged; it ensures that two application processes per- ™ A s ^11 be better understood from the following 

l°Znl , a T£ nn H tl0n |f r lthef f desc ription, with reference to FIG. 1, the present invention 

network understand each other. The OSI Reference Model, ;- H;„ rt ^ i~ , „n™,; r . • . 

as described in "Open Systems Interconnection (OSI)-! cUent cSou^rlt iS ^ f ch f P™^ 0 ? a 

New International Standards Architecture and Protocols for T , ^P" 1 " ^ , 10b ' ™ c - Wd - • • • * ' «» * •*» >° v °ke 
Distributed Information Systems," special issue, Proc. 25 Automauon objects located on remote ISAPI^nabled Web 
IEEE, vol. 71, no.12, December 1983, is hereby incorpo- scr ? IS , ' ' ° C ' ' " ' a °° m P uter network such 

rated by reference. as ^ Internet 20. 

The inventive protocol includes a data structure which Tne remote servers 30 may take the form of a host 
encodes, as a SOAP request, the name of the Automation computer 30a, a minicomputer 30b, a mainframe computer 
object of interest, a method to invoke in that object, and any 30 ^Oc, or any other configuration of computer. A typical client 
valid Automation [injout] parameters to be exchanged with computer 10a for implementing the invention is a general 
the object, and creates a client-side SOAP proxy for the purpose computing device such as a conventional personal 
Automation object. The range of valid parameter types is computer, which comprises such well-known items as a 
defined by the COM Automation Variant type. In addition to central processing unit 12, system memory 14, a modem 
Variant data types, the protocol also supports passing 35 and/or network card 16 for connecting the local computer to 
ActiveX Data Object Recordset objects (ADO). Variant and the Internet 20, a display 18, and other components not 
Automation "object" classes such as the ADO Recordset specifically shown in FIG. 1, such as a keyboard, mouse etc 
may be used as either [m], [out], or [in, out] parameters. The while the remote servers 30 will typically be university or 
MIME ? r °e y ^ rC<1 1 Part COrp ° ra,e mainframe computers 30c, as noted above, they 

nyirKxip 6 u u . j c ,.- , ... 40 ^ m av take me form of host personal computers 30a or 

MIME, which stands for Multipurpose Internet Mail 40 dedicated workstations such as minicomputers 30i. Since all 
Extensions, is an extension to the traditional Internet Mail _i J( ._, „„__„,„„ „„n h„„„ , h v _• J ^ 

protocol to allow for multimedia electronic mail. MIME was ' ' ™ . h TS ' ^ M 

developed to accommodate electronic mail messages con- 7 have ' m SfneraL 'he same properties, 

taining many parts of various types such as text, images, *?* s ™P hcit y of illustration and description, the following 
video, and audio. MIME is defined in Document RFC 1521 « description will describe the interaction between a client 
of the Network Working Group, September 1993, which is com P ute r 10 and a server computer 30. As will be better 
hereby incorporated by reference understood from the following description, the present 

The SOAP proxy marshals and transfers the multipart mventi °° implements 1 1 protocol called the Simple Object 
MIME-encoded SOAP request to an Applications Program- AcC ? SS * tolocoi (SOAP) as computer programs executing 
ming Interface (API) which acts as a server-side SOAP stub * 00 ^ Chent com P uler 10 and ° n ^ server computer 30. 
for processing SOAP messages. Marshaling is the process of In a P re sent embodiment, Automation objects are imple- 
packaging up the data so that when it is sent from one rented as COM automation objects. COM is the Compo- 
process to another, the receiving process can decipher the neDt 0 D J ect Model, by Microsoft Corporation of Redmond, 
data. The SOAP stub, which is running on the Web server, Wash-, is an implementation of component software 
unpacks and parses the SOAP request, instantiates the COM' ss technology, i.e., the idea of breaking large, complex soft- 
Automation object, and invokes the method with the mar- warc a PP ucat »°ns into a series of pre-built and easily 
shaled [in] parameters. The SOAP stub also returns any developed, understood, and changed software modules 
[out], or [in, out], or return, parameters from the COM called components. COM is described in Dale Rogerson, 
Automation object instance to the SOAP proxy, and the Inside COM, Microsoft Press, 1997, which is hereby incor- 
Automation object instance is reclaimed. Thus, SOAP is a 60 P oratcd bv reference. 

stateless protocol, i.e., one where object lifetimes only As shown in FIG. 2, running on the client computer 10 is 
extend to one method call, and which are recreated for each a client process 110, such as a Web browser. Running on the 
call to the object. server computer 30 is a corresponding process, such as a 

BRIEF DESCRIPTION OF THE DRAWINGS « ^t!*™ ^ r^t^Z. ^ ^ P mcGS ?"° * a 

65 script or application 130. The client process 110 has a 

The foregoing aspects and many of the attendant advan- run-lime environment 100 that, in addition to the Web 

tages of this invention will become more readily appreciated browser 110 also includes a SOAP proxy 140 whose nature 
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is described below. The server computer 30 is protected by FIG. 3 is a functional flow diagram illustrating in more 

a firewall 150. Running on the server computer behind the detail the operation of the invention. FIGS 4 and 5 depict 

firewall B the Web server 160, a SOAP stub 170 and an the data structure in more detail. FIG. 4 depicts the data 

instance of ar ■ Automaton object 180. The SOAP stub strucrure fem me client computer 10 toward the server 

Ponn/^H 7 Pr0XY m *?■ mann l r ( ^ cribed bel ° W - 5 com P uter 30 nG 5 de P"* the dau structure fromThe 

Connected to, and in communication with, the server com- „,„,_,„„,. i n ,„,„„„! ,• , . . A 

puter 30 is a database 190, which may located in the server ^ P " C ° mpU,Cr 10 " 

computer 30 itself, or located remotely on a database server Referring to FIG. 3, blocks 510 to 530, and 570 to 580 

(not shown). As described above in connection with FIG. 1, represent actions performed on the client computer. Blocks 

the client computer 10 and server computer 30 are connected 53S to S ^ represent actions performed on the server com- 
to, and in communication with, each other through a net- '° P utcr During the running of the client process (block 510), 

work such as the Internet 20. when the script or application 130 makes a method call 

SOAP is a data transmission paradigm. The data trans- * block 515 ). Advanced DataSpace 120 is created. See block 

mission paradigm includes a three-section data structure that 520 : ^ showa m FIG 2 > this occurs within the run-time 

comprises a header, body, and trailer. The data structure is environment 100 of the client process 110. The Advanced 

used to package information referring to a request to invoke DataSpace 120 is a clientside automation object, the sole 

a method of an Automation object. In operation, when the P^ose of which is to create a SOAP proxy 140 (block 525) 

client process 110 requires certain data from an Automation wth which me client P roc ess U° interacts. When the SOAP 

object, the process issues a method call, which causes an proxy K crcated > thc Advanced DataSpace tells the SOAP 

Advanced DataSpace 120 to be created. The Advanced P rox y me name of me server computer 30 that it is targeting 

DataSpace 120, in turn, creates a SOAP proxy 140 for the 35 wel1 45 me; name ' or P ro g id > of me Automation object 180 

Automation object 180. The SOAP proxy 180 packages the mat 15 t0 bc lnst antiated on the server, 

data structure as an HTTP POST message in multipart After the SOAP proxy 140 is created, a method call from 

MIME packets, and sends the message as a binary data me client process 110 is made on the SOAP proxy 140, 
stream through the network, i.e., the Internet 20, to the ^ which converts the method call into an HTTP POST mes- 

server computer 30 where the Automation object 180 is sa S e - ^ block 52 ~' '■ The HTTP POST message is shown in 

located. FIG. 4. The HTTP POST message has the previously 

When the server computer 30 receives the HTTP POST described three-section data structure, i.e., the HTTP POST 

message, the server process 160, i.e., the Web server 160, message comprises a header 310, a body 320 and a trailer 
invokes a SOAP stub 170 for the SOAP proxy 140. The' 30 330 - ^ header 310 includes fields for holding data repre- 

SOAP stub that is invoked is chosen based on an identifier seating a "POST" instruction 312, the name of an API for 

contained in the header of thc data structure. The SOAP stub P res sing the message 314, an indicator of the versioD of 

170 unpackages the multipart MIME packets and instanti- HTTP being used, 316 the progid 322 of the Automation 

ates the Automation object 180 identified in the header of the object to be instantiated, and a method name 324 that 
data structure. A method name field also identified in the 35 identmes the object method to be invoked, 

header of the data structure indicates the method of the In 311 exemplary embodiment of the present invention, the 

Automation object 180 to invoke. The method is invoked by HTTP version information is used by the client computer 10 

the SOAP stub 170 using [in] parameters contained in the to indicate to the server computer 30 the highest permissible 

body of the data structure. version of HTTP that can be used to format response 
After the method has finished executing, return, or [out], 40 messages produced by the server computer 30. While HTTP 

parameters are returned to the SOAP stub 170, which versions 1.0 and 1.1 are presently contemplated for use in 

packages the [out] parameters as multipart MIME packets actual embodiments of the invention, those skilled in the art 

and transmits a resulting HTTP Response message as a readily appreciate that any other versions of HTTP, as 

binary data stream across the Internet 20 to the SOAP proxy we U as various versions of other Internet protocols may be 
170. The SOAP proxy 170 unpackages the multipart MIME 45 used without departing from the spirit and scope of the 

packets and returns the [out] and [in, out] parameters to the invention. 

client process 100. The instance of the Automation object The header 310 of the HTTP POST message also includes 
180 is reclaimed after the [out] parameters are returned to fields for holding data representing the PROGRAM ID 
the SOAP stub 170. The [out] parameters, like the [in] (progid) 322 of the COM Automation object to be 
parameters, are contained in the body of the data structure. 50 instantiated, a method name 324 that identifies the object 
As noted above and illustrated in FIG. 2, the client process method to be invoked following the progid 22, and any [in] 
110 contains a script or application 130 for performing a parameters 326 that are needed by the method. The progid 
particular function. The script or application may be imple- is an alphanumeric representation of the unique GUID used 
mented by such means as components developed according to identify the Automation class to instantiate on the target 
to the ActiveX specification by Microsoft Corporation of 55 machine. The progid is used to obviate the need for appli- 
Redmond, Wash., or as an embedded script written in a cation developers to encode long (128 bit) numeric 
language such as JScript by Microsoft Corporation of sequences to identify Automation objects. This concept of 
Redmond, Wash. Those skilled in the art will readily rec- progids is a part of the COM Automation model defined by 
ognize alternative methods and means for implementing the Microsoft Corporation. The body of the HTTP POST mes- 
script or application, and will appreciate thai they may be 60 sa ge includes a MIME-encoding of the [in] parameters to be 
employed without departing from the spirit or scope of the passed to the instantiated Automation object. Trailer 330 of 
present invention. The client process 110 may be a Web the HTTP POST message preferably includes a field for 
browser, an example of which is the Internet Explorer, from holding data representing a checksum 332 for error- 
Microsoft Corporation, of Redmond, Wash., or may be any checking and correction purposes. 

distributed component software application requiring one or 65 Returning to FIG. 3, the SOAP proxy 140 takes the HTTP 

more components from one or more vendor sites located on POST message, and packages it as multipart MIM E packets 

the Internet. (block 527^ which are sent, as a binary data stream, across 



08/15/2003, EAST Version: 1.04.0000 



US 6,457,066 Bl 



the Internet 20, through a firewall 150, to the Web server 
1 60. - S ee block 530. Since the method call is encoded in 
HTTP, it passes through the firewall 150 without difficulty. 

An example of a suitable Web server 160 is the Internet 
Information Server (IIS), from Microsoft Corporation, of 
Redmond, Wash. A suitable API is the Internet Services 
Applications Programming Interface OSAPi) Dynamic Link 
Library (DLL), an example of which is the Advanced Data 
. ISAPI (ADISAPI) component, from Microsoft Corporation, 
of Redmond, Wash. The API forms the server side stub for 
the SOAP proxy 140, i.e., the API acts as the SOAP stub 
170. Thus, the SOAP stub 170 is a server-side Applications 
Programming Interface (API) that interacts with the SOAP 
proxy 140. Those skilled in the art will recognize that the 
Web server 160 and SOAP stub 170 may be implemented 
using programs other than IIS and ADISAPI, respectively, 
without departing from the spirit and scope of the invention. 

Returning to FIG. 3, upon receiving the multipart MIME 
encoded HTTP POST message, a test is made to determine 
if the SOAP stub 170 is running on the Web server. See 
block 535. If the SOAP stub is not running, the Web server 
160 in vokes the API named in the multipart MIME encoded 
HTTP POST message to act as the SOAP stub 170. See 
block 540. The SOAP stub 170 implements an HTTP parser 
that unpackages the multipart MIME packets into individual 
parameters for the method call. See block 545. 

At block 550, the Automation object 180 is instantiated by 
the SOAP stub 170, and the method call is made on the 
instantiated Automation object 180. For simplicity, any 
discussion of "Automation object" refers to the instance of 
the COM Automation object executing on the Web server 
160, rather than the program code merely stored on a disk or 
in memory. 

Next, at block 555, the Automation object 180 invokes the 
called method using the [in] parameters provided by the 
SOAP proxy 140, and returns the results, or [out] 
parameters, to the SOAP stub 170. See block 560. The 
SOAP stub 170 repackages the data as an HTTP response 
message in multipart MIME packets. See block 562. 

Those skilled in the art will appreciate that the Automa- 
tion object 180 and associated method may be used for 
performing a variety of functions. For instance, the method 
could be used to access and retrieve data from a database 
190 connected to the server computer, or the method could 
be used to insert data into the database 190, or both, by first 
retrieving data records from the database 190, updating the 
data records and then replacing the old database records with 
updated data records. 

The HTTP Response message is shown in FIG. 5. As 
previously described, the HTTP Response message is a three 
part structure that comprises a header 410, a body 420, and 
a trailer 430. The header 410 includes fields for holding data 
representing the version of HTTP being used 412, a status 
code 414, the progid 422 of the Automation object that was 
accessed, and the method name 424 of the object method 
that was invoked. The body 420 includes fields for holding 
data representing the [in] parameters 426 used by the 
method, and any [out], or return, parameters 428 returned by 
the method, i.e., the body includes the parameters marshaled 
across within the MIME stream — the "values" or state of the 
elements to be processed by the target Automation object. 
The trailer 430 of the HTTP Response message preferably 
includes a field for holding data representing a checksum 
432 that is used by the client computer for error-checking 
and correction purposes. 

Returning to FIG. 3, the web server 160 transmits the 
multipart MIME packets back to the client computer 10 via 
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the internet 20. See block 565. The SOAP proxy 140 
unpackages the MIME packets. See block 570. At block 575, 
the SOAP proxy returns the [out] parameters to the client 
process 110, which performs further processing, as required, 
s Thereafter, at block 580, the instance of the Automation 
object 180 is reclaimed; however as will be recognized by 
those skilled in the art, reclamation may occur at any time 
after the method has completed execution. 

As will be readily appreciated by those skilled in the art, 
10 the present invention solves problems associated with pass- 
ing distributed component software through a firewall. More 
specifically, the invention provides a way of allowing richer, 
interactive Web content to pass through firewalls. This is 
accomplished by an application layer protocol that allows 
remote Automation objects to be accessed using existing 
protocols which can pass through firewalls. 

While the preferred embodiment of the invention has been 
illustrated and described, it will be appreciated that within 
the scope of appended claims various changes can be made 
therein without departing from the spirit of the invention. 

The embodiments of the invention in which an exclusive 
property or privilege is claimed are defined as follows: 

1. A computer-readable medium have stored thereon a 
data structure for marshaling data between a client computer 
and a server computer through the Internet, wherein the 
marshaled data is associated with a method of an Automa- 
tion object to be instantiated on the server computer, the data 
structure comprising: 

(a) a header section comprising: 

(1) a first header data field containing data representing 
a name (progid) of the Automation object to be 
instantiated on the server computer; 

(2) a second header data field containing data repre- 
senting a name of a method of the Automation object 
to invoke; 

(b) a body section comprising: 

(1) at least one first body data field containing data 
representing input parameters for the method to 
process: and 

(c) a trailer section comprising a checksum value; wherein 

(d) the data structure is packaged as an HTTP message in 
multipart MIME packets. 

2. The data structure of claim 1, wherein: 

(a) the data structure is for marshaling data from the client 
computer to the server computer, and is packaged as an 
HTTP POST message; and wherein 

(b) the header comprises a third header data field con- 
taining data representing an HTTP POST request, and 
a fourth header data field containing data representing 
a name of an Applications Programming Interface 
(API) for unpackaging multipart MIME packets. 

3. The data structure of claim 2, wherein the header 
further comprises a fifth header data field containing data 
representing an identifier of an HTTP version. 

4. The data structure of claim 1, wherein: 

(a) the data structure is for marshaling data from the 
server compu ter to the client computer, and is packaged 
as an HTTP Response message; 

(b) the header section comprises a third header data field 
containing data representing an identifier of an HTTP 
version, and a fourth header data field containing data 
representing a status code; and wherein 

(c) the body section further comprises at least one second 
body data field containing data representing return 
parameters from the method. 
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TCP/IP OFFLOAD NETWORK INTERFACE This is particularly expensive because while the CPU is 

DEVICE moving this data it can do nothing else. While moving the 

data the CPU is typically stalled waiting for the relatively 
CROSS REFERENCE TO RELATED slow memory to satisfy its read and write requests. A CPU, 

APPLICATIONS 5 which can execute an instruction every 5 nanoseconds, must 

now wait as long as several hundred nanoseconds for the 
This application claims the benefit under 35 U.S.C. §119 memory controller to respond before it can begin its next 
(e) of provisional patent application Ser. No. 60/098,296, instruction. Even today's advanced pipelining technology 
filed Aug. 27, 1998, incorporated by reference herein. This doesn't help in these situations because that relies on the 
application also claims the benefit under 35 U.S.C. §119(e) w CPU being able to do useful work while it waits for the 
of provisional patent application Ser. No. 60/061,809, filed memory controller to respond. If the only thing the CPU has 
Oct 14, 1997. This application also claims the benefit under to look forward to for the next several hundred instructions 
35 U.S.C. §120 of U.S. patent application Ser. No. 09/067, is more data moves, then the CPU ultimately gets reduced to 
544, filed Apr. 27, 1998, now U.S. Pat. No. 6,226,680, and ^ e speed of the memory controller. 

U.S. patent application Ser. No. 09/141,713, filed Aug. 28, Moving all this data with the CPU slows the system down 

1998, both of which are incorporated by reference herein. 1 even aner ue data has been moved. Since both the source 

and destination cache lines must be pulled into the CPU 
CROSS REFERENCE TO COMPACT APPENDIX cache wncn ^ data «s moved, more than 3 k of instructions 

and or data resident in the CPU cache must be flushed or 
The Compact Disc, which is a part of the present invalidated for every 1500 byte frame. This is of course 
disclosure, includes a recordable Compact Disc (CD-R) 20 assuming a combined instruction and data second level 
containing information (including CD Appendices A, B, C cache, as is the case with the Pentium processors. After the 
and D) that is part of the disclosure of the present patent data has been moved, the former resident of the cache will 
document. A portion of the disclosure of this patent docu- likely need to be pulled back in, stalling the CPU even when 
ment contains material that is subject to copyright protec- we are not performing network processing. Ideally a system 
tion. All the materia] on the Compact Disc is hereby 25 would never have to bring network frames into the CPU 
expressly incorporated by reference into the present appli- cache, instead reserving that precious commodity for 
cation. The copyright owner of that materia] has no objec- instructions and data that are referenced repeatedly and 
tion to the facsimile reproduction by anyone of the patent frequently. 

document or the patent disclosure, as it appears in the Patent But the data movement is not the only drain on the CPU. 

and Trademark Office patent files or records, but otherwise 30 Th erc is also a fair amount of processing that must be done 

reserves all copyright rights. by 'he protocol stack software. The most obvious expense is 

calculating the checksum for each TCP segment (or UDP 

BACKGROUND OF THE INVENTION datagram). Beyond this, however, there is other processing 

... ..... to be done as well. The TCP connection object must be 

Network processing as it exists today is a costly and J5 ]ocated when a ^ ven xcp nt ^ jp header 

inefficient use of system resources. A200 MHz Pentium-Pro checksums must be ca]cu i atcd , merc are buffcr ^ mc 
B i r^f? CODS ™ ed procesaiw netwoik data from managcm ent issues, and finally there is also the significant 
a 100 Mb/second-network connection. The reasons that this expense of interrupt processing, discussed below, 
processing is so costly are described m the next few pages. A 64 k server message block (SMB) request (write or 
When network packet arrives at a typical network inter- 40 read-reply) is typically made up of 44 TCP segments when 
face card (NIC), the NIC moves the data into pre-allocated running over Ethernet, which has a 1500 byte maximum 
network buffers m system mam memory. From there the data transmission unit (MTU). Each of these segments may result 
is read into the CPU cache so that it can be checksummcd in an interrupt to the CPU. Furthermore, since TCP must 
(assuming of course that the protocol in use requires check- acknowledge (ACK) all of this incoming data, it's possible 
sums. Some, like IPX, do not.). Once the data has been fully 45 to get another 44 transmit-complete interrupts as a result of 
processed by the protocol stack, it can then be moved into sending out the TCP acknowledgements. While this is 
its final destination in memory. Since the CPU is moving the possible, it is not terribly likely. Delayed ACK timers allow 
data, and must read the destination cache line in before it can us to acknowledge more than one segment at a time And 
fill it and write it back out, this involves at a minimum 2 delays in interrupt processing may mean that we are able to 
more trips across the system memory bus. In short, the best so process more than one incoming network frame per inter- 
one can hope for is that the data will get moved across the rupt. Nevertheless, even if we assume 4 incoming frames per 
system memory bus 4 times before it arrives in its final input, and an acknowledgement for every 2 segments (as is 
dcstinatioa It can, and does, get worse. If the data happens typical per the ACK-every-other-segment property of TCP), 
to get invalidated from system cache after it has been we are still left with 33 interrupts per 64 k SMB request ' 
checksurnmed then it must get pulled back across the S5 Interrupts tend to be very costly to the system. Often when 
memory bus before it can be moved to its final destination. a syslem * in , erru pted, important information must be 
Finally on some systems, including Wmdows NT 4.0, the flushed or from me system ^ ^ ^ the 
data gets copied yetanother time while being moved up the mterrupt routine ^^0^ and needed dala can ^ Ued 
protocol stack. In NT 4.0, this occurs between the rmniport ^ rache , since the cpy ^ return ^ its rior , ocation 
dnver interface and the protocol driver interface. This can 60 afler ,h e it ^ like ly that the information flushed 
add up to a whopping 8 trips across the system memory bus f rom the cache will immediately need to be pulled back into 
(the 4 trips described above plus the move to replenish the ^ ^Ac. What's more, interrupts force a pipeline flush in 
cache plus 3 more to copy from the miniport to the protocol today . s ^v^d processors . while the processor pipeline is 
dnver). That s enough to bring even today's advanced an extremely efficient way of improving CPU performance 
memory busses to their knees. 65 jt te expensive t0 get going ^ it has been flnshed _ 
In all but the original move from the NIC to system Finally, each of these interrupts results in expensive register 
memory, the system CPU is responsible for moving the data. accesses across the peripheral bus (PCI). 
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We noted earlier that when the CPU has to access system FIG. 9A shows a received data packet for a TCP fast-path 

memory, it may be stalled for several hundred nanoseconds. connection. 

When it has to read from PCI, it may be stalled for many pir. on ct, m ,~ , a a , ■ . r -r™ , 

microseconds. This happens every time the CPU takes an connection received data packet for a TCP slow-path 

interrupt from a standard NIC. The first thing the CPU must 5 ". 

do when it receives one of these interrupts is to read the NIC C shoWS a reoeived ^ frame. 

Interrupt Status Register (ISR) from PCI to determine the FIG 10 shows sending a fast-path data packet, 

cause of the interrupt. The most troubling thing about this is FIG. 11 shows sending a slow-path data packet, 

that since interrupt lines are shared on PC-based systems, we FIG. 12 shows sending a non-data command to the INIC 

z^xtZuorT^ PCI read even when 10 fig - i3 H » a tr, ,nic "-^ ,o ,he inic 

„ , . miniport dnver over the PCI bus. 

Other peripheral bus inefficiencies also exist. Tvoical cir- ia-.a- c .^r.^ j • 

NIC* nnerafr ,.*,„,, rWri„.„r vju f* 'IP 1 * 11 FIG. 14 is a diagram of an INIC driver connected to plural 

thr 4ir r h S deSC " pt0r ""S^W™ a &«■>« arrives, INIC cards each having plural network connections 

the NIC reads a receive descriptor from system memory to , °" 

determine where to place the data. Once the data has been u \ sendm Z a packet containing an ATCP 

moved to main memory, the descriptor is then written back COmm n 

out to system memory with status about the received frame. F ' G - 16 shows mapping the command buffer of FIG. 15 

Transmit operates in a similar fashion. The CPU must notify and 6 ivin g address of that buffer to the INIC. 

that NIC that it has a new transmit. The NIC will read the FIG. 17 shows an example of a receive header and data 

descriptor to locate the data, read the data itself, and then 20 buffer that have been created by the INIC. 

write the descriptor back with status about the send. Typi- FIG. 18 shows the mapping of header buffer and data 

cally on transmits the NIC will then read the next expected buffer descriptors for a received packet. 

descriptor to see if any more data needs to be sent. In short, Fir. 10 i« » <t>tP Hi.onm a. , r c -. 

each receive or transmit frame results in 3 or 4 separate PCI Jl m }l£. ! h ff ? a* ^ 

reads or writes, not counting the status register read 25 ShOWlDg ^ mam eVents and <™siuons. 

FIG. 20 is a state diagram summary of a transmit finite 

SUMMARY OF THE INVENTION state machine showing the main events and transitions. 

The present invention offloads network processing tasks FI °- 21 .' S a diagram of ^ IN1C hardware, 

from a CPU to a cost-effective intelligent network interface FIG- 22 is a digram °f a communications microprocessor 

card (INIC). An advantage of this approach is that a vast 30 mcluded in tne 'NIC, showing functions associated with a 

majority of network message data is moved directly from the plurality of instruction phases. 

INIC into its final destination. Another advantage of this FIG. 23A is a diagram of a first phase oif the micropro- 

approach is that the data may be moved in a single trip across cessor of FIG. 22, including the first register set and related 

the system memory bus. The offloading allows the CPU to controls. 

avoid header processing, data copying, and checksumming. 3S FIG. 23B is a diagram of a second microprocessor phase 

Since network message data does not need to be placed in a including reading addresses and data out of a RAM file 

CPU cache, the CPU cache may be free for storage of other register. 

^^T^^^ may . ^ rCdUCCd i? FIG - 23C is a di «S™ n ° f a tni ' d microprocessor phase, 

k SM^ Zrn,^ ^ T , hro « ten ?>P«»Pcr64 including ALU and queue operations. P 

k SMB write. Other advantages include a reduction of CPU clr . X. ■ ,. , . 

reads over the PCI bus and fewer PCI operations per receive , h £ * * * ^ Van0US sequenccrs c °nti>med in 
or transmit transaction. 

FIG. 25 is a diagram of data movement for a Pci slave 

BRIEF DESCRIPTION OF THE DRAWINGS write to Dram. 

FIG. 1 is a diagram of fast-path and slow-path modes of 45 FIG 26 * a dia e ram of an SRAM Control Sequencer 
communication processing. contained in the INIC. 

FIG. 2 is a diagram of different buffers employed for the F1G 27 * a ti ™8 diagram for the SRAM Control 

fast-path and slow-path modes of processing received mes- Sequencer. 

sages. 50 FIG. 28 is a block diagram of an External Memory 

FIG. 3 is a diagram of buffers employed for the fast-path Control - 

and slow-path modes of transmitting messages. FIG. 29 is a timing diagram illustrating a data read from 

FIG. 4 shows an initial format of an interrupt status s^am. 

register (ISR) of the present invention. FIG. 30 is a block diagram of an External Memory Read 

FIG. 5 shows mapping of network packets according to ss Sec I ue n c er- 

the present invention with mbufs and buffer descriptors. FIG. 31 is a timing diagram illustrating a data write to 

FIG. 6 shows some control information structures used to sd ram. 

represent network addresses and protocols according to the FIG. 32 is a diagram of an External Memory Write 

present invention. Sequencer. 

FIG. 7 shows a host interface structure combining plural 6 ° F,G - 33 is a diagram of a PCI Master-Out Sequencer, 

protocol slacks and drivers for working with an add-on FIG. 34 is a diagram of a Pa Master-In Sequencer. 

lml ~ FIG. 35 is a diagram illustrating data movement from 

FIG. 8 A shows a received TCP packet after processing by dram to Pci target, 

the INIC. 6J pj G 35 js a di is[am of a Draln tQ pa 5,^,^^ 

.u ^S;^ 8 SbOWS S reCCiVed ^ frame ^ P rocessirj S b y FIG. 37 is a diagram illustrating data movement from a 
toe INIC. PCI toIget t0 ^ 
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FIG. 38 is a diagram of a PCI to Dram Sequencer. maximum window size is likely to restrict the number of 
FIG. 39 is a diagram illustrating data movement from outstanding segments considerably). 

Sram to Pci target. Fortunately, TCP performs a Maximum Segment Size 

FIG. 40 is a diagram of a Sram to PCI Sequencer. negotiation at connection establishment time, which should 
FIG. 41 is a diagram illustrating data movement from a S P rev ent I p fragmentation in nearly all TCP connections. The 

Pci target to dram. only time that we should end up with fragmented TCP 

FIG. 42 is a diagram of a PCI to Sram Sequencer connections is when there is a router in the middle of a 

Fir- a\ ; r , a- -n . .• j , connection which must fragment the segments to support a 

FIG. 43 is a diagram illustrating data movement from cm ,n„ uni n, i .... ^ ,, 

dram to Sram n smaller MTU The onl y networks that use a smaller MTU 

... ' than Ethernet are serial line interfaces such as SLIP and PPP 

FIG. 44 is a diagram of a Dram to Sram Sequencer. At the moment, the fastest of these connections only run at 

FIG. 45 is a diagram illustrating data movement from 128 k (ISDN) so even if we had 256 of these connections, 

Sram to dram. we would still only need to support 34 Mb/sec, or a little 

FIG. 46 is a diagram of a Sram to Dram Sequencer. over lnrce J " °T connections worth of data. This is not 
FIG. 47 is a diagram of a sequence of events when a PCI 1S ^^J° aDy enhancements that the 

Slave Input Sequencer is the target of a Pci write operation \ , ^ ° • , becomes an issue at some point, we may 

cir ao v a' c c , „ decide to implement the MTU discovery algorithm, which 

FIG. 48 is a diagram of a sequence of events when a Pa shniilH nrrvrnt Trp Fr^ m ^,^ nr . Ti . 

S.ave Output Sequencer is the target of a Pci read operation. ££ ^ itffcliSS 
49 * a diagram of a sequence of events for reception 20 the connection is established). With this in mind, it seems a 

of a packet. worthy sacrifice to not attempt to handle fragmented TCP 

FIG. 50 is a diagram of a Frame Receive Sequencer. segments on the INIC. 

FIG. 51 is a diagram of a sequence of events for trans- SPX follows a similar framework as TCP, and so the 

mission of a packet. expansion of the INIC to handle IPX/SPX messages is 

FIG. 52 is a diagram of a Frame Transmit Sequencer. straightforward. UDP, on the other hand, does not support 

FIG. 53 is a timing diagram for a Queue Manager ^ n0ti °° of a Maximum Segment Size, so it is the 

p, r -> • .. .,, „ . . & responsibility of IP to break down a UDP datagram into 

FIG. 54 is a diagram of the Queue Manager. MTU sized packets. Thus, fragmented UDP datagrams are 

DETAILED DESCRIPTION OF THE 30 Ve . ry ™ m ™J? e moSt , -, UDP a PP» icat "»> rannin 8 

PREFERRED EMBODIMENT today 15 ° V " UDR Whfle ^ is ab ° ^ most 

common version of NFS running today, the current version 

In order to keep the system CPU from having to process °f Solaris being sold by Sun Microsystems runs NFSV3 

the packet headers or checksum the packet, this task is over TCP by default. A first embodiment described in detail 

performed on the INIC, which presents a challenge. There 3S m this document offers network processing assistance to 

are more than 20,000 lines of C code that make up the non-fragmented TCP connections on the INIC, while exten- 

FreeBSD TCP/IP protocol stack, for example. This is more sion of this design to process other message protocols, such 

code than could be efficiently handled by a competitively as SPX/TPX is straightforward. 

priced network card. Further, as noted above, the TCP/IP As noted above, fragmented TCP segments are not fiully 
protocol stack is complicated enough to consume a 200 4Q processed by the initial INIC configuration. We have also 
MHz Pentium-Pro. In order to perform this finction on an opted to not have the INIC handle TCP connection and 
inexpensive card, special network processing hardware has breakdown. Other TCP "exceptions" which we have elected 
been developed instead of simply using a general purpose to not handle on the INIC include: 1) Retransmission 
CPU Timeout — Occurs when we do not get an acknowledgement 
In order to operate this specialized network processing 45 f° r previously sent data within the expected time period; 2) 
hardware in conjunction with the CPU, we create and Out of order segments — Occurs when we receive a segment 
maintain what is termed a context. The context keeps track with a sequence number other than the next expected 
of information that spans many, possibly discontiguous, sequence number; 3) FIN segment — Signals the close of the 
pieces of information. When processing TCP/IP data, there connection. 

are actually two contexts that must be maintained. The first so Since we have now eliminated support for so many 
context is required to reassemble IP fragments. It holds different code paths, it might seem hardly worth the trouble 
information about the status of the IP reassembly as well as to provide any assistance by the INIC at all. This is not the 
any checksum information being calculated across the IP case. According to W. Richard Stevens and Gary Write in 
datagram (UDP or TCP). This context is identified by the Mjlume 2 of their book "TCP/IP Illustrated", which along 
IP_JD of the datagram as well as the source and destination 55 with Vblumc 1 is incorporated by reference herein, TCP 
IP addresses. The second context is required to handle the operates without experiencing any exceptions between 97 
sliding window protocol of TCP. It holds information about and 100 percent of the time in local area networks As 
which segments have been sent or received, and which network, router, and switch reliability improve this number 
segments have been acknowledged, and is identified by the is likely to only improve with time. 

IP source and destination addresses and TCP source and sn A* Qhnwn ■„ Fir 1 a;it^„, 1 r 

destination ports , Sh °T m " G " *' dlfi ? rent modes of °Pe™tion are 

^ employed depending upon whether a given network packet 

If we were to choose to handle botb contexts in hardware, fits our criteria for processing by an INIC 50 or a host 52 
we would have to potentially keep track of many pieces of The INIC 50 has a physical layer 55 connected by a PCI bus 
information. One such example is a case in which a single 57 to a physical layer 59 of the host 52. The INIC 50 has 
64 k SMB write is broken down into 44 1500 byte TCP 65 media access (MAC) 63, IP 64, TCP 65 and netbios 66 
segments, which are in turn broken down into 131 576 byte hardware processing layers, while the host 52 has media 
IP fragments, all of which can come in any order (though the access (MAC) 73, IP 74, TCP 75, and TDI 76 hardware 
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processing layers, which operate on behalf of a client 77. In 
a first mode, termed fast-path 80, network frames are 
processed on the IN1C 50 through TCP. In a second mode, 
termed slow-path 82, the network frames are processed 
through the card and the card operates like a conventional 
NIC. In the slow-path case, network frames are handed to 
the system at the MAC layer and passed up through the host 
protocol stack like any other network framne. In the fast- 
path case, network data is given to the host after the headers 
have been processed and stripped. 

The transmit case works in much the same fashion. In 
slow-path mode the packets are given to the INIC with all of 
the headers attached. The INIC simply sendsjhcse-packets 
out as if it were a dumb NIC. I n-fast^pathmode, the hps? 
rgives~raw~data to the INIC which it must carve into MSS 
|sized:segments=add : heaclers to the data, perform checksums 
on the segment, and then_ send it o ut.on-the-wire.— ^ 

OccasionallyTituations arise for which a TCP connection 
being handled by the INIC needs to be returned to the host 
for processing. To accomplish this transfer of responsibility 
for handling a connection we create a communication con- 
trol block (CCB). A CCB is a structure that contains the 
entire context associated with a connection. This includes 
the source and destination IP addresses and source and 
destination TCP ports that define the connection. It also 
contains information about the connection itself such as the 
current send and receive sequence numbers, and the first-hop 
MAC address, etc. The complete set of CCBs exists in host 
memory, but a subset of these may be "owned" by the card 
at any given time. This subset is the CCB cache. The INIC 
can own (cache) up to 256 CCBs at any given time. 

CCBs are initialized by the host during TCP connection 
setup. Once the connection has achieved a "steady-state" of 
operation, its associated CCB can then be turned over to the 
INIC, putting the connection into fast-path mode. From this 
point on, the INIC owns the connection until either a FIN 
arrives signaling that the connection is being closed, or until 
an exception occurs which the INIC is not designed to 
handle (such as an out of order segment). When any of these 
conditions occur, the INIC will then flush the CCB back to 
host memory, and issue a message to the host telling it that 
it has relinquished control of the connection, thus putting the 
connection back into slow-path mode. From this point on, 
the INIC simply hands incoming segments that are destined 
for this CCB off to the host with all of the headers intact. 

Note that when a connection is owned by the INIC, the 
host is not allowed to reference the corresponding CCB in 
host memory as it will contain invalid information about the 

state of the connection. anything more will be delivered as a small piece (which may 

When a frame is received by the INIC, it must verify it 50 be 128 bytes), while waiting until receiving the destination 
completely before it even determines whether it belongs to memory address before moving the rest, 
one of its CCBs or not. This includes all header validation The trick then is knowing when the data should be 
(is it IP, IPV4 or V6, is the IP header checksum correct, is delivered to the client or not. As we've noted, a push flag 
the TCP checksum correct, etc). Once this is done it must indicates that the data should be delivered to the client 
compare the source and destination IP address and the 55 immediately, but this alone is not sufficient. Fortunately, in 



Since the card will automatically calculate the checksum for 
TCP segments, we can pass this on to the host, even when 
the segment is for a CCB that the INIC does not own. 

By moving TCP processing down to the INIC we have 
offloaded the host of a large amount of work. The host no 
longer has to pull the data into its cache to calculate the TCP 
checksum. It does not have to process the packet headers, 
and it does not have to generate TCP ACKs. We have 
achieved most of the goals outlined above, but we are not 
done yet. 

The following paragraphs define the INIC's relation to the 
host's transport layer interface, called TDI or Transport 
Driver Interface in Windows NT, which is described in detail 
further below with regard to the Alacritech TCP (ATCP) 
driver. 

Simply implementing TCP on the INIC does not allow us 
to achieve our goal of landing the data in its final destination. 
Somehow the host has to tell the INIC where to put the data. 
This is a problem in that the host can not do this without 
knowing what the data actually is. Fortunately, NT has 
provided a mechanism by which a transport driver can 
"indicate" a small amount of data to a client above it while 
telling it that it has more data to come. The client, having 
then received enough of the data to know what it is, is then 
responsible for allocating a block of memory and passing the 
memory address or addresses back down to the transport 
driver, which is in rum responsible for moving the data into 
the provided location. 

We will make use of this feature by providing a small 
amount of any received data to the host, with a notification 
that we have more data pending. When this small amount of 
data is passed up to the client, and it returns with the address 
in which to put the remainder of the data, our host transport 
driver will pass that address to the INIC which will send the 
remainder of the data into its final destination via direct 
memory access (DMA). 

Clearly there are circumstances in which this does not 
make sense. When a small amount of data (500 bytes for 
40 example), with a push flag set indicating that the data must 
be delivered to the client immediately, it does not make 
sense to deliver some of the data directly while waiting for 
the list of addresses to DMA the rest. Under these 
circumstances, it makes more sense to deliver the 500 bytes 
45 directly to the host, and allow the host to copy it into its final 
destination. While various ranges are feasible, it is currently 
preferred that anything less than a segment's (1500 bytes) 
worth of data will be delivered directly to the host, while 
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source and destination TCP port with those in each of its 
CCBs to determine if it is associated with one of its CCBs. 
This is an expensive process. To expedite this, we have 
added several features in hardware to assist us. The header 
is fully parsed by hardware and its type is summarized in a 
single status word. The checksum is also verified automati- 
cally in hardware, and a hash key is created out of the IP 
addresses and TCP ports to expedite CCB lookup. For full 
details on these and other hardware optimizations, refer to 
the INIC hardware specification sections below. 

With the aid of these and other hardware features, much 
of the work associated with TCP is done essentially for free. 



60 



65 



the case of NetBIOS transactions (such as SMB), we are 
explicitly told the length of the session message in the 
NetBIOS header itself. With this we can simply indicate a 
small amount of data to the host immediately upon receiving 
the first segment The client will then allocate enough 
memory for the entire NetBIOS transaction, which we can 
then use to DMA the remainder of the data into as it arrives. 
In the case of a large (56 k for example) NetBIOS session 
message, all but the first couple hundred bytes will be 
DMA'd to their final destination in memory. 

But what about applications that do not reside above 
NetBIOS? In this case we can not rely on a session level 
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protocol to teU us the length of the transaction. Under these As mentioned at the beginning of this section, most NICs 

circumstances we will buffer the data as it arrives until, 1) work on a descriptor queue algorithm in which the NIC 

we have received some predetermined number of bytes such reads a descriptor from main memory in order to determine 

as 8 k, or 2) some predetermined period of time passes where to place the next frame. We will instead write receive 

between segments, or 3) we get a push flag. If after any of s buffer addresses to the INIC as receive buffers are filled. In 

these conditions occur we will then indicate some or ail of order to avoid having to write to the INIC for every receive 

the data to the host depending on the amount of data frame, we instead allow the host to pass off a pages worth 

buffered. If the data buffered is greater than about 1500 bytes ( 4 k ) of buffers in a single write. 

we must then also wait for the memory address to be In order to reduce further the number of writes to the 

returned from the host so that we may then DMA the to INIC, and to reduce the amount of memory being used by 

remainder of the data. the host, we support two different buffer sizes. Asmall buffer 

The transmit case is much simpler. In this case the client contains roughly 200 bytes of data payload, as well as extra 

(NetBIOS for example) issues a TDI Send with a list of fields whining status about the received data bringing the 

memory addresses which contain data that it wishes to send totaI **** to 256 bvtes - We can therefore pass 16 of these 

along with the length. The host can then pass this list of U sma11 buffers at a time to the INIC. Large buffers are 2 k in 

addresses and length off to the INIC. The INIC will then pull size ' Tnev are used 10 contain any fast or slow-path data that 

the data from its source location in host memory, as it needs docs QOt fil m a smaU buffer. Note that when we have a large 

it, until the complete TDI request is satisfied. fast-path receive, a small buffer will be used to indicate a 

Note that when we receive a large SMB transaction, for ^nl?, °f- ^f^' wMe ^ remainder of me data ^ 

example, that there are two interactions between the INIC 20 be DMA d memory - ^ buffers m never 

and the host. The first in which the INIC indicates a small paSSed 10 th f u hoSt by lves > .™ tead ^ are alwa y s 

amount of the transaction to the host, and the second in accom P ar " c d by a small buffer which contains status about 

which the host provides the memory location® in which the along with the large buffer address. By operating 

INIC places the remainder of the data. This results in only m m ,f l ma ° ncr ' thc dnver must only maintain and process the 

two interrupts from the INIC. The first when it indicates the 25 SmM bu f f r . qUeue - buffers are returned to the host by 

small amount of data and the second after it has finished be ! ng attached to smaU &nce large buffers 

filling in the host memory given to it. Note the drastic k m SlZe "* passed to ^ INIC ^ buffers at a 
reduction from the interrupts generated by a conventional 

33/64 k SMB request that was mentioned in the background In addltlon to needing a manner by which the INIC can 

On transmit, we actually only receive a single interrupt when 30 paSS mcomln g data to us, we also need a manner by which 

the send command that has been given to the INIC com- we can lnstnjct Ae INIC to send data - Plus, when the INIC 

pjetes indicates a small amount of data in a large fast -path receive, 

Having now established our interaction with Microsoft's *, ^1 °* ^ .^j"^™ " 

TDI interface, we have achieved our goal of landing most of „ £T *J * ta Pnt*e remainder of the data We accomplish 

our data directly into its final destination in host memory. We 35 b0th ° f ^ u ^ ,* command buffer ' Sadl y> the 

have also managed to transmit all data from its original command 15 ,he Pl ace which we must violate 

location on host memory. And finally, we have reduced our °"L Y P , data a<™<* P°. For the command 

interrupts to two per 64 k SMB read and one per 64 k SMB H t f ^T" ^ ^ 

write. The only thing that remains in our list of objectives is ?* mC ^ re ,f* tbe COn,entS ° f th * ro , mmaDd buffer mto 

to design an efficient host (PCI) interface. 40 ' U mem0ry 50 ^ U Can f ecute ^ desired «"nmand. 

„ , ....... since a command may take a relatively lone time to 

• , J?" T!Tn ary objeC *? VeS m t™® 11 ^ * e host complete, it is unlikely that command buffers will complete 

interface of the INIC was to eliminate PCI reads m either m order . For mis reason we ^ mainUin a f£ ^ 

direction PCI reads are particularly inefficient in that they queue . hjke the small and , receive buff * WQrth 

completely staU the reader until the transaction completes. 45 of resp onse buffers is passed to the INIC at a time Response 

As we noted above, this could hold a CPU up for several buffers are om 32 ^ we have , Q lenjsh m£ ^ 

microseconds, a thousand times the time typically required supply of ^ m re i at ively infrequenUy. The response buffers 

to execute a single instruction. PCI writes on the other hand, only fc to mdi J te ^^{^ of £ designated 

are usually buffered by the memory-bus^PCI-br.dge, command buffer> ^ to stat J aboul the cation 

allowing the writer to continue on with other instructions <•„ -n. t n . j •,_ 

This technique is known as "posting". 50 „ ^ f ° U ° Wmg describe some of the differing 

-n. i j u • ■ 7 L data flows mal we mi ^ tit xs oa the INIC. For the first 

The only PCI read that is required by many conventional example of a fast-path receive, assume a 56 k NetBIOS 

NICs b the read of the interrupt status register. This register sessi on message is received on the INIC. The first segment 

gives the host CPU information about what event has caused will contain the NetBIOS header, which contains the total 

an interrupt (if any). In the design of our INIC we have 55 NetBIOS length. A small chunk of this first segment is 

elected to place this necessary status register into host provided to the host by filling in a small receive buffer 

memory. Thus, when an event occurs on the INIC, it writes modifying the interrupt status register on the host, and 

the sUtus register to an agreed upon location in host raising appropriate interru t line . v receivin the 

memory. The corresponding dnver on the host reads this mte rrupt, the host will read the ISR, clear it by writing back 

local register to determme the cause of the mterrupt. The 60 to r^c's Interrupt Clear Register, and will then process 

mterrupt lines are held high untd the host clears the interrupt ils smaU ^ive buffer queue looking for receive buffers to 

by wnnng to the INIC s Interrupt Clear Register. Shadow be processed. Upon finding the small buffer, it will indicate 

registers are maintained on thc INIC to ensure that events are me ^ amount of data up , 0 me client to ^ processed by 

not NetBIOS. It will also, if necessary, replenish the receive 

Smce it is imperative that our INIC operate as efficiently 65 buffer pool on the INIC by passing off a page worth of small 

as possible, we must also avoid PQ reads from the INIC. We buffers. Meanwhile, the NetBIOS client will allocate a 

do this by pushing our receive buffer addresses to the INIC. memory pool large enough to hold the entire NetBIOS 
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message, and will pass this address or set of addresses down In contrast, a standard NIC would instead generate an 

to the transport driver. The transport driver will allocate an interrupt, an interrupt status register read, an interrupt clear 

INIC command buffer, fill it in with the list of addresses, set register write, and a descriptor read and write. The data 

the command type to tell the INIC that this is where to put would get moved across the system bus a minimum of four 

the receive data, and then pass the command off to the INIC 5 times. The resulting TCP ACK of the data, however, would 

by writing to the command register. When the INIC receives add vet another interrupt, another interrupt status register 

the command buffer, it will DMA the remainder of the read > interrupt clear register write, a descriptor read and 

NetBIOS data, as it is received, into the memory address or write, and yet more processing by the host protocol slack, 

addresses designated by the host. Once the entire NetBIOS These examples illustrate the dramatic differences between 

transaction is complete, the INIC will complete the com- to fast-path network message processing and conventional, 

mand by writing to the response buffer with the appropriate Achievements of the Alacritech INIC include not only 

status and command buffer identifier. processing network data through TCP, but also providing 

In this example, we have two interrupts, and all but a zero-copy support for the SMP upper-layer protocol. It 

couple hundredbytes are DMA'd directly to their final accomplishes this in part by supporting two paths for 

destination. On PCI we have two interrupt status register 1 5 sendin g and receiving data, a fast-path and a slow-path. The 

writes, two interrupt clear register writes, a command reg- fast-path data flow corresponds to connections that are 

ister write, a command read, and a response buffer write. In maintained on the INIC, while slow-path traffic corresponds 

contrast, a host having a conventional NIC would experi- 10 network data for which the INIC does not have a 

ence an estimated 30 interrupts, 30 interrupt register reads, connection. The fast-path flow includes passing a header to 

30 interrupt clear writes, and 58 descriptor reads and writes! 2° nos f 811(1 subsequently holding further data for that 

Moreover, the data may be moved anywhere from four to connection on the card until the host responds via an INIC 

eight times across the system memory bus. command with a set of buffers into which to place the 

For the second example, a slow-path receive, assume the accun ™ lated da ^ In the slow-path data flow, the INIC will 

INIC receives a frame that does not contain a TCP segment be °P eratul 8 « a " dumb " NIC, so that these packets are 

for one of its CCBs. In this case the INIC simply passes the 2S SUDply dumped mto frame buffers on the host 38 the y 

frame to the host as if it were a dumb NIC, according to the Io order 10 su PP ort both fast and slow paths, a novel host 

slow-path. If the frame fits into a small buffer (-200 bytes or mterface strategy is employed. Note that with the INIC we 

less), then it simply fills in the small buffer with the data and have some c^He^es that are not found with conventional 

notifies the host. Otherwise it places the data in a large NIC designs. Atypical NIC has a transmit and a receive ring 

buffer, writes the address of the large buffer into a small 30 ° f buffer descri P tors - When the NIC receives a frame, it 

buffer, and again notifies the host. The host, having received grabs a descri P lor off of the receive queue, if one is 

the interrupt and found the completed small buffer, checks to avauable > locates a buffer address specified within the 

see if the data is contained in the small buffer, and if not, recelve descriptor, and moves the receive frame to that 

locates the large buffer. Having found the data, the host will add ress. After the data has been moved, the descriptor is 

then pass the frame upstream to be processed by the standard 3S u P dated ""h status indicating that a frame has been 

protocol stack. It must also replenish the INIC's small and received . a nd the driver is notified via a write to interrupt 

large receive buffer pool if necessary status register followed by an interrupt. In this environment, 

With the INIC, this slow-path results in one interrupt, one WiU typiC *% re P lace ' he °T flUed - in buffer °° 

interrupt status register write and one interrupt clear renter 1 ueue ' wth *»™ fre <= 

write as well as a possible small and or large receive buffer s >milarly, in a typical NIC, when the driver wishes to send 

register write. The data will go through the normal path a ^ amC ' " 1D a descn P tor on the transmit queue with the 

although if it is TCP data then the host will not have to iddKSS and length of data t0 be transmitted and writes to a 

perform the checksum. A conventional NIC handling this K ^ ler oa me Mc teU »ng it that there is at least one pending 

frame will cause a single interrupt, an interrupt status tria&ail1 - The NIC de-queues the now valid transmit 

register read, an interrupt clear register write, and a descrip- 45 descn P tor > locates ,he add 'ess ™d length, and sends the 

tor read and write. The data will get processed as it would ? ame T* 0n me we - Upon c ° m Pl e4 >°n it will notify the 

by the INIC, except for a possible extra checksum. Thus the (via M ISR / lnterru P0 that the frame has been sent, at 

slow-path receive mode is much like conventional, except Wh ' cb P 01 "' i? e d j iver 0311 free the mem °ry containing the 

for hardware assists with items such as check sums Send frame - 0ur 6151 challen g e mm&s from the fact that in 
ii,. ,L* j , c . j . . 50 our de sign, transmits can complete out of order. For 

client h« n P ' ^ ^ ' hat f example > since ™ card ° ffl ° ad * TCP processing from the 

chent hasa smaU amount of data to send It will jssue the host CPU> it ^ of a P M k sls f B 

™ a nd f h ,ff fin T SZ" I, h,Ch H IrS? ' a single On the INIC itself, this 64 k transmit is 

tnTr? < ' m r ■ , 6 add T. ° f 400 bytC broken down toto ma °y etheraet « accordance with 

Srfj Tn^T H m ff ^ Ca K 4 tranSmit - 55 Me TCP majdmum se g menl s * e (MSS). Because the TCP 

L^n^nT , T^r v ^ n5 7T^° n wiadow size b «»i«Hyabout 8 k, we can not send the 64 

iZJHZ L T 1 DMA th t T k in a sia & block of frames - Insl " d INIC will have to 

h T hT' F*™* 3 ?l C ^ ^ *° ,hrou S h TCP send/acknowledgment phases before 

T^T AffT£ Naders and send the frame out ^ entire 64 k has ^ XDl whjle ti J^ g J lhe ™ 

w^> theTnn,£ * hS T , ackn ° wlCdSemeDt U 60 ^ issue a command to «™« bjL. This, of 

5^22 UlC h0Sl ° f ™*Pte*" b y wnt *g t° a course, will complete much sooner than the 64 k send 

vw- u . „„ request. These out-of-order send completions will not work 

With the INIC, this will result 10 one interrupt, one with the typical transmit queue design because there is no 

interrupt status register write, one interrupt clear register way for the driver to know which resources it can free when 

write, a command buffer register write a command buffer 65 it gets a transmit completion interrupt. To resolve this we 

read, and a response buffer write. The data is DMA'd introduce a command/response handshake between the 

directly from the system memory. driver and the INIC. The driver maintains a queue of 
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response buffers. When it wishes to send data it fills in a memory are 256 bytes long, and are aligned on 256 byte 
command (like a transmit descriptor) and writes the physical boundaries. There wfll be a field in the header buffer 

add u eS fK^ th . e ,^° mr ? an ^^ the ,NIC h ^ seDds a handlc indicating it has valid data. This field will initially be reset 
a k ^ u^. C ? completes the request, it writes by lhe hosI before passing ^ buffer t0 the INIC 

rhe handle back to the response queue of the driver The s a set of header buffers are passed from the host to the INIC 
driver uses this handle to locate the original command buffer h v , hf> h „ e . „_;„•_„ ,„ uLh», d «■ 4JJ „ 
so it can free the send resources. ^m,^™ g ° " ea fi der buffer Address Register on 

c . . ^ 'NIC. This register is defined as follows: 

h>r receiving messages we have abandoned the conven- D , ai „ m. i 

tional receive descriptor queue for performance reasons Bits il-« Physical address in host memory of the first of 

Small transactions on PCI can severely limit PCI bus 10 a x \ of conll ^°^ header buffers, 
throughput. In the typical receive descriptor environment a Blls 7-0 Number °f header buffers passed. 

NIC must first read a descriptor (typically 16-32 bytes) ia lnis wav tne host can > say, allocate 16 buffers in a 4 
across PCI to get the buffer information. It then moves the P a ge, and pass all 16 buffers to the INIC with one register 
data across PCI into the buffer, and then writes status back write. For each interface, the INIC will maintain a queue of 
into the receive descriptor. One objective of our receive 15 these header descriptors in the SmallHType queue in its own 
design was to eliminate the first descriptor read. Thus we loca ' memory, adding to the end of the queue every time the 
needed an efficient way in which to pass receive buffer nost writes to one of the Header Buffer Address Registers, 
addresses to the INIC. We accomplished this by passing a Note that the single entry is added to the queue; the eventual 
block of receive buffers to the INIC at one time. In the driver dequeuer will use the count after extracting that entry, 
we allocate a block of contiguous memory (typically a page, M The header buffers will be used and returned to the host 
which is typically 4 k). We write the address of that block to in the same order that they were given to the INIC. The valid 
the INIC with the bottom bits of the address specifying the field will be set by the INIC before returning the buffer to the 
number of buffers in the block. In order to receive 1514 byte host. In this way a PCI interrupt, with a single bit in the 
frames (maximum ether frame size), however, we can only interrupt register, may be generated to indicate that there is 
fit two buffers in a 4 k page, which is not a substantial ^ a header buffer for the host to process. When servicing this 
savings. Fortunately, network frames tend to be either large interrupt, the host will look at its queue of header buffers, 
(-1500 bytes), or small (<256 bytes). reading the valid field to determine how many header buffers 

We take advantage of this fact by allocating large and arc ,0 bc processed, 
small receive buffers. If a received frame fits in a small Receive data buffers are allocated in blocks of two, 2 k 
buffer, the INIC will use a small buffer. Otherwise it will use 30 bytes each (4 k page). In order to pass receive data buffers 
a large buffer. A problem with that system then is preserving 10 tbe INIC, the host must write two values to the INIC. The 
receive order. If we were to maintain a small and a large first value to be written is the Data Buffer Handle. The buffer 
buffer queue, there would be no way to know in which order handle is not significant to the INIC, but will be copied back 
two frames, one small and one large, were received. A to the host to return the buffer to the host. The second value 
solution is to maintain a single receive queue of small 35 written is the Data Buffer Address. This is the physical 
buffers. The host passes the small buffers in blocks of 16 at address of the data buffer. When both values have been 
a time to the INIC, and they are guaranteed to be returned written, the INIC will add these values to FreeType queue of . 
to us in the order in which they were given to the INIC. The dat a buffer descriptors. The INIC will extract two entries 
small buffer contains status about the receive as well as eacD time when dequeuing. 

small frames. If a received frame does not fit in the small 40 Data buffers will be allocated and used by the INIC as 
buffer, then we allocate a large buffer and place a pointer to needed. For each data buffer used, the data buffer handle will 
that large buffer in the small buffer. Thus, large buffers are be copied into a header buffer. Then the header buffer will 
only returned to the driver when attached to small buffers. be returned to the host. 

As shown in FIG. 2, the fast-path flow puts a header such A transmit interface is shown in FIG. 3. The transmit 
as HEADER A 90 into a header buffer that is then forwarded 45 interface, like the receive interface, has been designed to 
to the host. HEADER A contains status 92 that has been minimize the amount of PCI bandwidth and latencies. In 
generated by the INIC and TCP/SMB headers 94 that can be order to transmit data, the host transfers a command pointer 
used by the host to determine what further data is following 110 to the INIC. This command pointer includes a command 
and allocate the necessary host buffers, which are then buffer handle 112, a command field 113, possibly a TCP 
passed back to the INIC as data buffer descriptors 96 via a so context identification 114, and a list of physical data pointers 
command to the INIC. The INIC then fills these buffers from 116. The command buffer handle is defined to be the first 
data it was accumulating on the card and notifies the host by word of the command buffer and is used by the host to 
sending a response to the command. Alternatively, the identify the command. This word is passed back to the host 
fast-path may receive a header and data that is a complete in a response buffer queue, since commands may complete 
request, but that is also too large for a header buffer. This 55 out of order as depicted by crossed arrows 118 and 120, and 
results in a header and data buffer being passed to the host. the host needs to know which command is complete. Com- 
This latter flow is similar to the slow-path flow of HEADER mands can be used for many reasons, but primarily cause the 
B 98, which also puts all the data into the header buffer or, INIC to transmit data, or to pass a set of buffers to the INIC 
if the header buffer is too small, uses a large (2K) host buffer for input data on the fast-path as previously discussed, 
for all the data. This means that on the unsolicited receive 60 Response buffers are physical buffers in host memory and 
path, the host will only see either a header buffer or a header contain status 122 regarding the command as well as the 
and at most, one data buffer. Note that data is never split command buffer handle. They are used by the INIC in the 
between a header and a data buffer. same order as they were given to the INIC by the host. This 

The order in which data is written is important. Data enables the host to know which response buffers) to next 
buffers are moved by DMA into the host before the header 65 look at when the INIC signals a command completion, 
buffer, since the header buffer contains the status word Command buffers in host memory are a multiple of 32 
designating that the data has arrived. Header buffers in host bytes, up to a maximum of IK bytes, and are aligned on 32 
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byte boundaries. A command buffer is passed 10 the INIC by 
writing to the Command Buffer Address Register for a given 
interface. This register is defined as follows: 

Bits 31-5 Physical address in host memory of the com- 
mand buffer. 

Bits 4-0 Length of command buffer in bytes/32 (i.e. 
number of multiples of 32 bytes) 

This is the physical address of the command buffer. For 
each interface we have a transmit command register and a 
receive command register. When one of these registers has 
been written, the INIC will add the contents of the register 
to it's own internal queue of command buffer descriptors. 
The first word of all command buffers is defined to be the 
command buffer handle. It is the job of the utility processor 
to extract a command from its local queue, DMA the 
command into a small INIC buffer (from the FreeSType 
queue), and queue that buffer into the Xmit#Type queue, 
where # is 0-3 depending on the interface, or the appropriate 
RCV queue. The receiving processor will service the queues 
to perform the commands. When that processor has com- 
pleted a command, it extracts the command buffer handle 
and passes it back to the host via a response buffer. 

Response buffers in host memory are 32 bytes long and 
aligned on 32 byte boundaries. They are handled in a very 
similar fashion to header buffers. There is a field in the 
response buffer indicating it has valid data. This field is 
initially reset by the host before passing the buffer descriptor 
to the INIC. A set of response buffers are passed from the 
host to the INIC by the host writing to the Response Buffer 
Address Register on the INIC. This register is defined as 
follows: 

Bits 31-8 Physical address in host memory of the first of 
a set of contiguous response buffers 

Bits 7-0 Number of response buffers passed. 

In this way the host can, say, allocate 128 buffers in a 4K 
page, and pass all 128 buffers to the INIC with one register 
write. The INIC maintains a queue of these header descrip- 
tors in its ResponseType queue for each interface, adding to 
the end of the queue every time the host writes to the 
Response Buffer Address Register. The INIC writes the 
extracted contents including the count, to the queue in 
exactly the same manner as for the header buffers. 

The response buffers are used and returned to the host in 
the same order that they were given to the INIC. The valid 
field is set by the INIC before returning the buffer to the host. 
In this way a PCI interrupt, with a single bit in the interrupt 
register, may be generated to indicate that there is a response 
buffer for the host to process. When servicing this interrupt, 
the host will look at its queue of response buffers, reading 
the valid field to determine how many response buffers are 
to be processed. 

FIG. 4 shows an initial format of a thirty-two bit interrupt 
status register (ISR) of the present invention. Bit thirty-one 
(ERR-131) is for setting error bits, bit thirty (RCV-130) 
denotes whether a receive has occurred, bit twenty-nine 
(CMD-129) denotes whether a command has occurred, 
while bit twenty-five (RMISS-125) denotes whether a 
receive has occurred 

It is designed that the setting of any bits in the ISR will 
cause an interrupt, provided the corresponding bit in an 
Interrupt Mask Register is set. The default setting for the 
IMR is 0. 

It is also designed that the host should never need to 
directly read the ISR from the INIC. To support this, it is 
important for the host/INIC to arrange a buffer area in host 
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memory into which the ISR is dumped. To accomplish this, 
the driver will write the location of the memory-based ISR 
to the Interrupt Status Pointer Register on the INIC. 

For the host to never have to actually read the register 
from the INIC itself, it is important for the INIC to update 
this host copy of the register whenever anything in it 
changes. The host will Ack (or deassert) events in the 
register by writing to the register with 0's in appropriate bit 
fields. So that the host does not miss events, the following 
scheme is employed: 

The INIC keeps a local copy of the register whenever the 
INIC DMAs it to the host after some eventfs). This is termed 
COPYA. Then the INIC starts accumulating any new events 
not reflected in the host copy in a separate word. This is 
called NEWA. As the host clears bits by writing the register 
back with those bits set to zero, the INIC clears these bits in 
COPYA (or the host write -back goes directly to COPYA). If 
there are new events in NEWA, it ORs them with COPYA, 
and DMAs this new ISR to the host. This new ISR then 
replaces COPYA, NEWA is cleared and the cycle then 
repeats. 

Table 1 lists the INIC register addresses. For the sake of 
simplicity, the registers are in 4-byte increments from what- 
ever the TBD base address is. 

TABLE 1 



ISP 


0x0 


Interrupt Status Pointer (0-3) 


ISR 


Ox 10 


Interrupt Status Response (0-3) 


IMR 


0x20 


Interrupt Mask (0-3) 


HBAR 


0x30 


Header Buffer Address (0-3) 


DBAR 


0x40 


Data Buffer Address (and Handle)(0-3) 


CBAR 


0x50 


Command Buffer Address XMT (0-3) 


RBAR 


0x60 


Response Buffer Address (0-3) 


RCBAR 


0x70 


Receive Command Buffer Address 



45 



50 
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In order to coordinate operation of the INIC with a host 
computer, we have designed an Alacritech TCP (ATCP) 
transport driver. The ATCP driver runs on the host and 
consists of three main components. The bulk of the protocol 
stack is based on the FreeBSD TCP/IP protocol stack. This 
code performs the Ethernet, ARP, IP, ICMP, and (slow path) 
TCP processing for the driver. At the top of the protocol 
stack we introduce an NT filter driver used to intercept TD1 
requests destined for the Microsoft TCP driver. At the 
bottom of the protocol stack we include an NDIS protocol- 
driver interface which allows us to communicate with the 
INIC miniport NDIS driver beneath the ATCP driver. 

In order to ensure that our ATCP driver is written in a 
consistent manner, we have adopted a set of coding guide- 
lines. These proposed guidelines were introduced with the 
philosophy that we should write code in a Microsoft style 
since we are introducing an NT-based product. The guide- 
lines below apply to all code that we introduced into our 
driver. Since a very large portion of our ATCP driver is based 
on FreeBSD, and since we were somewhat time-constrained 
in our driver development, the ported FreeBSD code is 
exempt from these guidelines. 

Guidelines 

Global symbols — All function names and global variables 
in the Atcp driver begin with the "ATK" prefix (ATKSendo 
for instance). 

We use the #define ALACRITECH to identify those 
sections of code which must be conditionally compiled (or 
not compiled) in the ATCP as opposed to BSD environment. 

Variable names — Microsoft seems to use capital letters to 
separate multi-word variable names instead of underscores 
(VariableName instead of variable_name). We adhere to 
this style. 
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Structure pointers— Microsoft typedefs all of their struc- In addition, every global data structure such as a list or 

tures. The structure types are always capitals and they hash table also has a protecting spinlock which must be held 

typedef a pointer to the structure as "P" <name> as follows: while the structure is being accessed or modified. The NT 

typedef struct _FOO { DDK in fact provides a number of convenient primitives for 
INT bar; s SMP-safe list manipulation, and these are used for any new 

} FOO, *PFOO; lists- Existing list manipulations in the FreeBSD code will be 

We adhere to this style. Function calls — Microsoft sepa- left t0 minimize code disturbance, except of course that 

rates function call arguments on separate lines- me necessary spinlock acquisition and release must be added 

X=foobar( around them ' 

argumentl, 10 Spinlocks should not be held for long periods of time, and 

argument^ m ? st especially, must not be held during a sleep, since this 

y will lead to deadlocks. There is a significant deficiency in the 

We'adhere to this style m kerneI support for SMP systems: it does not provide an 

Comments-While Microsoft seems to alternatively use °P~ ra *° n > a spinlock to be exchanged atomi- 

// and r */ comment notation, we exclusively use the /* */ 15 f^' y J 0r asleep Iock \ 11115 wouId be a serious P rob,em » a 
notation UNIX environment where much of the processing occurs in 

Function comments-Microsoft includes comments with £f COnte * 1 °f the "fj* ' P rocess ' which initia,ed operation, 
each function that describe the function, its arguments and SpmIock WOuld have j° be ex P hcitl y released, followed 

its return value. We also include these comments, but move by . a se P arate acquisition of the sleep lock: creating an unsafe 
them from within the function itself to just prior to the 20 ^SJ^'i- 

function for better readability The NT approach is more asynchronous, however: IRPs 

Function arguments-Microsoft includes the keywords Smply m ^ ed as ' PEt ^ D1 ^ G ' when 311 operation cannot 

IN and OUT when defining function arguments/These b , e com P 1 « ed immediately. The calling thread does NOT 
keywords denote whether the function argument is used as Skep at . ""i P ° mt: ™ KtUCDS ' a " d may g ° ° D With other 
an input parameter, or alternatively as a placeholder for an 25 processln g- Pending IRPs are later completed, not by waking 
output parameter. We include these keywords Up me thread which lruUated them, but by an 'IoCompl- 

Function prototypes-as far as possible we collect all new eteR «? ue sf call which typically runs at DISPATCH level in 
function prototypes in a single file: atcpnrototypes.h. The an arbitrary context. 

practice of proliferating a header file for every «.c- file is - T?" 8 ™ tha ' have ^ « [* cX use ° sleep locks anywhere 
avoided. Prototypes of existing BSD functions are left in 30 m ,he d f, Slgn of the **** drlver ' 50 hopefully the above 
their current header files, however, to minimize differences BSU , e T 1 D ° l u A ^ „ . 

between our code and the BSD base - ^ desc nbed above, the ATCP driver supports two paths 

Indentation-Microsoft code fairly consistently uses a for ending and receiving data, the fast-path and the slow- 
tabstop of 4. We adhere to this style path - 1136 fast "P ath da,a flow corresponds to connections that 

Header file #ifndef-^ach header file should contain a 35 maint!UDed ° n ^ e 1 ™ C > slow-path traffic corre- 

#ifndef/#define/#endif which is used to prevent recursive Sp0nds !° ns ;twork data for which the INIC does not have a 
header file includes. For example, foo.h would include- connection. Note that in fast-path, all of the data that flows 

#ifhdcf FOO H between host and INIC is "pure payload": all protocol 

c — — — processing is done on the INIC. In slow-path, however, the 

#define _FOO_H_ 40 INIC is operating as a conventional "dumb NIC", and the 

<foo.h contents. .> packets passing between ATCP driver and INIC contain all 

#endif /*_FOO_H_*/ the header info from MAC layer on up. 

Note the _NAME_H_ format. for a first implementation, we divide network communi- 

Each file must contain a comment at the beginning which cation into NETBIOS traffic, which is identifiable by port 
includes the $ld$ as follows: 45 number, and everything else. 

I* For NETBIOS input, as soon as the INIC has received a 

, $ . , s segment containing a NETBIOS header, it will forward it up 

to the TCP driver, along with the NETBIOS length from the 
"I header. Alternatively the host can acquire this information 

CVS (RCS) will expand this keyword to denote RCS so from the header itself, but since the INIC has already done 
revision, timestamps, author, etc. the decode, it seem reasonable to just pass it. 

The next few paragraphs describe a configuration From the TDI spec, the amount of data in the buffer 
designed to make the ATCP driver SMP safe. The basic rule actually sent must be at least 128 bytes. In fact, we have 
for SMP kernel code is that any access to a memory variable room for 192 bytes in our "small" buffers; and experiments 
must be protected by a lock, which prevents a competing 55 show that, to make the NETBTcode respond correctly, it is 
access by code running on another processor. Spinlocks are necessary to pass more than 128 bytes. So for a full segment 
the normal locking method for code paths that do not take a which starts with a NETBIOS header, we pass a "header" of 
long time to execute (and which do not sleep.) 192 bytes, together with the actual NETBIOS length, which 

In general each instance of a structure includes a spinlock, will be indicated up as the "available" length. For segments 
which must be acquired before members of that structure are 60 less than a full 1460 byte payload, all of the received 
accessed, and held while a function is accessing that instance segment will be forwarded; it will be absorbed directly by 
of the structure. Structures which are logically grouped the TDI client without any further memory descriptor list 
together may be protected by a single spinlock: for example, (MDL) exchange. Experiments tracing the TDI data flow 
the 'in_pcb' structure, 'tcpcb' structure, and 'socket' struc- show that the NETBT client directly absorbs up to 1460 
ture which together constitute the administrative informa- 65 bytes: the amount of payload data in a single Ethernet frame, 
tion for a TCP connection will be collectively managed by Once the INIC has passed to the host an indication with 
a single spinlock in the corresponding connection object. an NETBIOS length greater than the amount of data in the 
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packet il passed, it goes into a state where it is expecting an 
MDL from the host; in this slate, any further incoming data 
is accumulated in DRAM on the INIC. Overflow of IN1C 
DRAM buffers will be avoided by using a receive window 
of (currently) 8760 bytes on the INIC. 

On receiving the indicated packet, the ATCP driver calls 
the receive handler registered by the TDI client for the 
connection, passing the actual size of the data in the packet 
from the INIC as "bytes indicated" and the NETBIOS length 
as "bytes available." 

In the "large data input" case, where "bytes available" 
exceeds the packet length, the TDI client then provides an 
MDL, associated with an IRP, which must be completed 
when this MDL is filled. (This IRP/MDL may come back 
either in the response to ATCP's call of the receive handler, 
or as an explicit TDI_RECEIVE request.) 

The ATCP driver builds a "receive request" from the MDL 
information, and passes this to the INIC. This request 
contains the TCP context identifier, size and offset 
information, a scatter/gather list of physical addresses cor- 
responding to the MDL pages, a context field to allow the 
ATCP driver to identify the request on completion, and a 
piggybacked window update information (this will be dis- 
cussed below). 

Note: the ATCP driver must copy any remaining data (not 
taken by the receive handler) from the header indicated by 
the INIC to the start of the MDL, and must adjust the size 
& offset information in the request passed to .the INIC to 
account for this. 

Once the INIC has been given the MDL, it will fill the 
given page(s) with incoming data up to the requested 
amount, and respond to the ATCP driver when this is done. 
Note that the INIC maintains its advertised receive window 
as the maximum (currently 8760 bytes) while filling the 
MDL, to maximize throughput from the client. 

On receiving the "receive request" response from the 
INIC, the ATCP driver completes the IRP associated with 
this MDL, to tell the TDI client that the data is available. 

At this point the cycle of events is complete, and the 
ATCP driver is now waiting for the next header indication. 

In the general case we do not have a higher-level protocol 
header to enable us to predict that more data is coming. The 
original idea was to accumulate segments until a given 
amount (e.g. 8K) was available, and then send a header to 
the host to cause it to provide an MDL in which to place the 
data. 

A problem with this approach is that the INIC would be 
required to close its advertised receive window as segments 
were accumulated, which would stall output from the send- 
ing client. To avoid this, we resorted (after some 
experimentation) to a subterfuge. On receiving an initial full 
segment, the INIC sends a header of 192 bytes: but also 
passes a fictitious "available length" of (currently) 8760 
bytes. 

As in the NETBIOS case, if "bytes available" exceeds 
"bytes indicated", the TDI client will provide an IRP with an 
MDL. The ATCP driver will pass the MDL to the INIC to be 
filled, as before. The INIC moves succeeding incoming 
segments into the MDL: and since the granting of the MDL 
may be regarded as a "promise" by the TDI client to accept 
the data, the INIC does not have to close its advertised 
receive window while filling the MDL. The INIC will reply 
to the ATCP driver when it has filled the MDL; the ATCP 
driver in turn completes the IRP to the TDI client. 

Of course, since there is no higher- level protocol to tell us 
what the actual data length is, it is possible (for protocols 
such as FTP and HTTP) to receive a FIN before the MDL is 
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filled. In that case, we do a "short completion", which causes 
the 'information* field of the IRP corresponding to the MDL 
to be set to the actual length received: less than the MDL 
size. Fortunately, WINSOCK clients (and the AFD driver 
through which they communicate with the TCP driver) 
appear to handle this correctly. 

If the INIC "owns" an MDL provided by the TDI client 
(sent by the ATCP as a receive request), it will treat data 
placed in this as being "accepted by the client." It may 
therefore ACK incoming data as it is filling the pages, and 
will keep its advertised receive window fully open. 

However, for small requests, there is no MDL returned by 
the TDI client: it absorbs all of the data directly in the 
receive callback function. In this case we need to update the 
INIC's view of data which has been accepted, so that it can 
update its receive window. In order to be able to do this, the 
ATCP driver accumulates a count of data which has been 
accepted by the TDI client receive callback function for a 
connection. 

From the INIC's point of view, though, segments sent up 
to the ATCP driver are just "thrown over the wall"; there is 
no explicit reply path. We therefore piggyback the update on 
requests sent out to the INIC. Whenever the ATCP driver has 
outgoing data for that connection, it places this count in a 
field in the send request (and then clears the counter.) 
Receive requests (passing a receive MDL to the INIC) also 
are used to transport window update information in the same 
way. 

Note that there is also a message path whereby the ATCP 
driver explicitly sends an update of this "bytes consumed" 
information when it exceeds a preset threshold, to allow for 
scenarios in which the data stream is entirely one-way. 

The fast-path transmit or output data flow is considerably 
simpler. In this case the TDI client provides a MDL to the 
ATCP driver along with an IRP to be completed when the 
data is sent. The ATCP driver then gives a request 
(corresponding to the MDL) to the INIC. This request 
contains the TCP context identifier, size and offset 
information, a scatter/gather list of physical addresses cor- 
responding to the MDL pages, a context field to allow the 
ATCP driver to identify the request on completion, and 
piggybacked window update information. 

The INIC will copy the data from the given physical 
locationfs) as it sends the corresponding network frames 
onto the network. When all of the data is sent, the INIC will 
notify the host of the completion, and the ATCP driver will 
complete the IRP. 

Note that there may be multiple output requests pending 
at any given time. SMB allows multiple SMB requests to be 
50 simultaneously outstanding, and other protocols (e.g. FTP) 
often use double-buffering. 

For data for which there is no connection context being 
maintained on the INIC, the host performs the TCP, IP, and 
Ethernet processing (slow-path ). To accomplish this, ATCP 
ports the FreeBSD protocol stack. In this mode, the INIC is 
operating as a "dumb NIC"; the packets which pass over the 
NDIS interface contain MAC-Iayer frames. 

The memory buffers (MBUFs) in the incoming direction 
are in fact managing NDIS-allocated packets. In the outgo- 
ing direction, we have protocol-allocated MBUFs in which 
to assemble the data and headers. The MFREE macro is 
cognizant of the various types of MBUFs, and is able to 
handle each type. 

ATCP retains a modified socket structure for each 
connection, containing the socket buffer fields expected by 
the FreeBSD code. The TCP calls that operate on socket 
buffers (adding/removing MBUFs to & from queues, indi- 
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eating acknowledged & received daia, etc.) remain (as far as response to the initial command is received, there are still 

possible) unchanged in name and parameters from the slow-path frames in a queue waiting to be delivered. 

FreeBSD base, though most of the actual code needed to Therefore, once the IN1C has established its provisional 

implement them is rewritten. These are functions in kera/ context (and is now blocking and queuing any further input), 

uipc_socket2.c; the corresponding ATCP code is mostly in s it sends a "NULL" interlock frame on the regular frame 

atksocket.c. input path. This frame identifies the CCB context and signals 

The upper socket layer (kerrVuipc_socket.c), where the mat no further slow-path frames will follow for that context, 

overlying OS moves data in and out of socket buffers, must since tnis frame lrave ls on the same pathway as data frames, 

be entirely re-implemented to work in TDI terms. Thus, we kn° w w hen we receive it that it signifies the end of any 

instead of sosend( ), there is a function that maps the MDL to P°^i ble slow-path data. 

provided in a TDI_SEND call with a MBUF and queues it . nce the " NULL " frame has been seen (and any preced- 

on to the socket 'send' buffer. Instead of soreceive( ) there 1 ? g *°^~j ,atb da,a has 1x611 M1 ? Passed), we know that 

is a handler that calls the TDI client receive callback aTih^ ° Vlh "at^d!?^ is a stable and quiescent state, 

function, and also copies data from socket receive buffer * ' hM ^ ,he " commit "> 

MBUFs into any Mol provided by the TDI client (either „ ^^ f ^ J^^""*"™ 

RFPRW? n It ° r V S£para,e TOL N °* ^ are Latfons in which the ATCP driver 

^ at™ «, 0DS m ltUS Ca,6g0ry 316 ' D decides ' after havin S *«* "intention" command, 

ntT .u . i. • • ,« t that the context is not to be passed after all. (For example, 

Note that there is a semantic difference between TDI_ the TDI client may issue a TDI_DIS CONNECT or a 
SEND and a write( ) on a BSD socket. The latter may 20 slow-path frame arriving before the NULL interlock' frame 
complete back to its caller as soon as the data has been might contain a FIN.) So we must allow for the possibility 
copied into the socket buffer. The completion of a TDI_ that the second command may be a "flush", which should 
SEND, however, implies that the data has actually been sent cause the 1N1C to deallocate and clear up its "provisional" 
on the connection. Thus we need to keep the TDI_SEND context. 

IRPs (and associated MDLs) in a queue on the socket until 25 The ATCP driver must place some restrictions on exactly 
the TCP code indicates that the data from them has been when a "flush" command may be sent, to avoid unnecessary 
AC EF d - complexity in the IN1C state machine. Specifically, it must 

To pass a context from the ATCP to the IN1C for pro- not send a "flush" command when there is an outstanding 
cessing via the fast-path, a synchronization problem must be first- or second-half migration command. If a flush situation 
addressed. The ATCP driver makes the decision when a 30 arises while a migration command is outstanding the con- 
given connection should be passed to the INIC. The criterion dition is noted in host connection flags, and the actual flush 
is basically that the connection is on an interesting protocol command is sent only when the NULL frame is received (in 
port, and is currently quiescent: i.e. there is no currently the first-half case) or the command response is received (in 
outstanding input or output data which would cause the TCP the second-half case.) 

sequence fields to change as it is processed. 35 The converse situation of passing the context from the 

To initiate a handout to the INIC, the ATCP driver builds FNIC to the ATCP may be initiated either by the ATCP 
and sends a command identifying this connection to the driver or by the INIC. The machinery for flushing the 
INIC. Once this is sent, ATCP pends and queues any new context from the INIC to the ATCP is similar regardless of 
TDI_SEND requests; they will be acted on once fast-path which system initiated the transfer. If the ATCP driver 
processing is fully established. 40 wishes to cause context to be flushed from INIC to host it 

The problem arises with incoming slow-path data. If we sends a "flush" message to the INIC specifying the context 
attempt to do the context-pass in a single command number to be flushed. Once the INIC receives this it 
handshake, there is a window during which the ATCP driver proceeds with the same steps as for the case where the flush 
has sent the context command, but the INIC has not yet is initiated by the INIC itself. 

acquired or has not yet completed setting up the context. 45 The INIC sends an error response to any current outstand- 
Durmg this time, slow-path input data frames could arrive ing receive request it is working on (corresponding to an 
and be fed into the slow-path ATCP processing code. Should MDL into which data is being placed.) Along with this 
that happen, the context information which the ATCP driver response, it sends a "resid" field to reflect the amount of data 
passed to the INIC would no longer be correct. We could that has not been placed in the MDL buffers at the time of 
simply abort the outward pass of the context in this event, 50 the flush. Likewise the INIC sends an error response for any 
but it turns out that this scenario is quite common. So it outstanding send requests. For each of these, it will send a 
seems better to have a reliable handshake, which is accom- "resid" field to indicate how much of the request's data has 
phshed with a two-exchange handshake. not been ACK'd. The INIC also DMAs the CCB for the 

The initial command from ATCP to INIC expresses an context back to the host Note that part of the information 
"intention" to band out the context It carries a context 55 provided with a context is the address of the CCB in the host 
number; context numbers are allocated by the ATCP driver, The INIC sends a "flush" indication to the host, via the 
which keeps a per-INIC table of free and in-use context regular input path as a special type of frame, identifying the 
numbers. It also includes the source and destination IP context which is being flushed. Sending this indication via 
addresses and ports, which will allow the INIC to establish the regular input path ensures that it will arrive before any 
a "provisional" context. Once it has this "provisional" 60 following slow-path frames. 

context in place, the INIC will not send any more slow-path At this point, the INIC is no longer doing fast-path 
input frames to the host for that sre/dest IP/port combination, processing. It discards its CCB context for this connection 
instead queuing them, if any are received. and any further incoming frames for the connection will 

Receipt of the response to this initial command does not simply be sent to the host as raw frames for the slow input 
suffice to provide a reliable interlock, however. Command 65 path. 

responses and data frames follow entirely different paths As soon as the ATCP driver detects that a flush is in 
from INIC to the ATCP driver. It is possible that when the progress on a connection, it sets a state flag on its connection 
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context to indicate this fact. The ATCP driver may become expect to simply "drop-in" the FreeBSD code as is. The 

alerted about a flush in several ways: it might be an explicit interface of this code to the NT system requires some 

host-initiated flush, or it may see either the flush frame or an significant code modifications. This mostly occurs at the 

error on a send or receive request. The order in which these topmost and bottommost portions of the protocol stack, as 

are received may vary because, as we noted earlier, the s well as the "ioctl" sections of the code. Modifications for 

receive frame and command response paths are unrelated. SMP are also necessary. Further, unnecessary code has been 

The ATCP driver will not be able to complete the cleanup removed, 

operations needed to resume normal slow path processing FreeBSD TCP/IP protocol stack makes use of many 

until all the necessary pieces of information are received: the Unix svsteni services. These include bcopy to copy memory, 

"flush frame" indicating that the IN1C has DMA'd back the to m ? lloc lo allocate memory, timestamp functions, etc. These 

CCB, and the error completions of all outstanding send and wJ1 001 b f. ltcm J? ed m delail sincc the conversion to the 

receive requests. corresponding NT calls is a fairly trivial and mechanical 

Slow-path frames could arrive during this time: this is 0P i^Jl°" n-„„ D cn> , <r . ■ 

why the driver must set the "flushing" state flag on the mh V,?.f ,,„h J?' T^f" 
™„„„,.,:„„ -ru„ .o- . p , u - ■ . u .u u c • r mbufs. Under NT network buffers are mapped usme a 
connection. The effect of this is to change the behavior of .5 combination of packet descriptors and buffer de^riptors (the 
tcp_input( ). Ttus runs as a function call ,n me context of buffer descriptors are really MDLs). There are a couple of 
ip_input( ), and normally returns only when incoming problems with the NT method. First it does not provide the 
frames have been processed as far as possible (queued on the necessary fields which allow us to easily strip off protocol 
socket receive buffer or out-of-sequence reassembly queue.) headers. Second, converting all of the FreeBSD protocol 
However, if there is a flush pending and we have not yet 20 code to speak in terms of buffer descriptors is an unneces- 
completed resynchronization, we cannot do TCP processing sary amount of overhead. Instead, in our port we allocate our 
and must instead queue input frames for TCP on a "holding own mbuf structures and remap the NT packets as shown in 
queue" for the connection, to be picked up later when FIG. 5. 

context flush is complete and normal slow path processing FIG. 5 shows FreeBSD mbufs 140 and 142 including data 
resumes. (This is why we want to send the "flush" indication 25 pointers 150 and 152, which point to the current location of 
via the normal input path: so that we can ensure it is seen the data, data length fields and flags. In addition each mbuf 
before any following frames of slow-path input.) 155 and 157 will point to a packet descriptor 160 which is 

When the ATCP driver has received the "flush frame" plus associated with the data being mapped. Once an NT packet 
error responses for all outstanding requests, it has all the is mapped, our transport driver should never have to refer to 
information needed to complete its cleanup. This involves 30 the packet or buffer descriptors 162 and 164 for any infor- 
completing any IRPs corresponding to requests which have mation except when we are finished and are preparing to 
entirely completed, adjusting fields in partially-completed return the packet. 

requests so that send and receive of slow path data will There are a couple of things to note here. The IN1C has 
resume at the right point in the byte streams and propagating been designed such that a packet header should never be 
any timer expiration states from the INIC to the BSD code. 35 split across multiple buffers. Thus, we should never require 
Once all this cleanup is complete, the ATCP driver will the equivalent of the "m_pullup" routine included in Unix, 
loop pulling any "pended" TCP input frames off the "pend- Also note that there are circumstances in which we will be 
ing queue" mentioned above and feeding them into the accepting data that will also be accepted by the Microsoft 
normal TCP input processing. After all input frames on this TCP/IP. One such example of this is ARP frames. We also 
queue have been cleared off, the "flushing" flag can be 40 build our own ARP cache by looking at ARP replies as they 
cleared for the connection, and the host is back to normal come off the network. Under these circumstances, it is 
slow-path processing. important that we do not modify the data, or the packet and 

A portion of the ATCP driver is either derived or directly buffer descriptors. We will discuss this further below, 
taken from the FreeBSD TCP/IP protocol stack. The fol- Also note thatwe allocate a pool of mbuf headers at ATCP 
lowing paragraphs discuss the issues associated with porting 45 initialization time. It is important to remember that unlike 
this code, the FreeBSD code itself, and the modifications other NICs, we do not simply drop data if we run out of the 
required for it to suit our needs. Note, however, that details system resources required to manage/map the data. The 
of the higher, TCP-level part of the port are postponed until reason for this is that we will be receiving data from the card 
later, since this needs some groundwork from the discussion that has already been acknowledged by TCP. Because of this 
of the NT TDI interface. 50 it is important that we never run out of mbuf headers. To 

FreeBSD TCP/IP (current version referred to as Net/3) is solve this problem we statically allocate mbuf headers for 
a general purpose TCP/IP driver. It contains code to handle the maximum number of buffers that we will ever allow to 
a variety of interface types and many different kinds of be outstanding. By doing so, the card will run out of buffers 
protocols. To meet this requirement the code is often written in which to put the data before we will run out of mbufs, and 
in a sometimes confusing, convoluted manner. General- 55 as a result, the card will be forced to drop data at the link 
purpose structures are overlaid with other interface-specific layer instead of us dropping it at the transport layer, 
structures so that different interface types can coexist using We also use a pool of actual mbufs (not just headers), 
the same general-purpose code. For our purposes much of These mbufs are needed in order to build output packets for 
this complexity is unnecessary since we are initially only the slow-path data path, as well as other miscellaneous 
supporting several specific protocols. It is therefore tempting 60 purposes such as for building ARP requests. We allocate a 
to modify the code and data structures in an effort to make pool of these at initialization time and add to this pool 
it more readable, and perhaps a bit more efficient. There are, dynamically as needed. Unlike the mbuf headers described 
however, some problems with doing this. above, which are used to map acknowledged TCP data 

For this reason we have initially kept the data structures coming from the card, the full mbufs contain data that can 
and code at close lo the original FreeBSD implementation as 65 be dropped if we cannot get an mbuf. 
possible. The code has, however, been modified for several The following paragraphs describe the lower-level sec- 
reasons. First, as required for NT interaction— we can't rions of the FreeBSD TCP/IP port, up to and including the 
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IP level. These sections include Interface Initialization, ARP, physical hardware. That occurs in the second phase, when 
Route, IP, and ICMP. Discussions of modifications to the NDIS calls our ATKBindAdaptor function to set up the 
TCP layer are postponed, since they need some grounding in connection to the actual NDIS-level adaptor: 
the NT TDI interface described below. | ATKBind Adaptor lo cates the IFACE for the given adaptor 

There are a variety of structures, which represent a single s name, and"does a query-request-to-hnMS-toobtairthe-MAC 
interface in FreeBSD. These structures include ifnet, address-for-the-interface;-this-is-saved-in-the-arpcom-stnict. 
arpcom, ifaddr, in_ifaddr, sockaddr, sockaddr_in, and It;then"does-amimber-of-other-queries-for characteristics of 
sockaddr_dl. , thejnterface jmdjstores the res ults in th e IFACE. Next/ it 

FIG. 6 shows the relationship between some of these pas^.dowr^Lcurrent_IP_addresses usinR-an-Alicritech- 
stmctures. In this example we show a single interface with to s P, ecific OJDjJhis is needed because the 1NIC driyer_at the 
a MAC address (sockaddr_dl-170) of 00 60 97 DB 9B A6 lo.wer-leve l'needs- to-know-about-IP-addresses~ in orde r to 
configured with an IP address (sockaddr in-172) of panels correcU y -to-eimer-me-ATCP-dfivefn6rthe 

192.100.1.2. As illustrated above, the in ifaddr 175 is feg f *^r- T ^£*^&±sectkm 9.2.1.) FinaUyJt 
actually an ifaddr 177 structure with some extra fields tacked m«^-the-mterf a ce-uprand-broadcasts a gratuitous ARP 
™ t„ ,L «™i tw 'f aa ■ a U,l * cu request to notfy pthers. of our-Mac /IP address:andlo^eteci 

on to the end. Thus the ifaddr structure is used to represent is dui,hcatF-IP^?e^roTihT5er 

both a MAC address and an IP address. Similarly the Microsoft m Xcp/Ip code s ^ Dynamic Host 

sockaddr structure is recast as a sockaddr_dl or a Configuration Protocol (DHCP), whereby one can arrange 
sockaddr_in depending on its address type. An interface can for an interface to not be assigned a static IP address but 
be configured to multiple IP addresses by simply chaining rather, to search for a DHCP server to assign one for it to use. 
in_ifaddr structures after the in_ifaddr structure shown 20 In this case ATKIflnit does not find an address in the registry 
above. As mentioned in the porting philosophy section, for the interface: it will arrive later in a DEVICE_IO_ 
many of the above structures could likely be collapsed into CONTROL on the IP device object. Our filter driver attaches 
fewer structures. In order to avoid making unnecessary to and monitors the IP device object as well as the TCP one. 
modifications to FreeBSD, for the time being we have these We catch the completion of the IOCTL_SET_ 
structures mostly unchanged. We have, however, eliminated 25 DHCPADDR request in the ATKDHCPDone function (in 
the fields from the structure that will never be used. These file atkdhcp.c); there, we decode the parameters and locate 
structure modifications are discussed below. the interface. Then we call the BSD in_control function to 

We also show in FIG. 6 a structure called IFACE 180. set the IP address and netmask, and replicate the later part of 
This is a structure that we define, in proto.h. It contains the the ATKBindAdaptor processing (which can't be done there 
arpcom 182 structure, which in turn contains the ifinet 185 30 in the case of a DHCP interface since we don't yet have an 
structure. It also contains fields that enable us to blend our IP address) to complete the process of making the interface 
FreeBSD implementation with NT NDIS requirements. One active. 

such example is the NDIS binding handle used to call down The DHCP protocol provides a time-limited "lease" of an 
to NDIS with requests (such as send). IP address: this implies that DHCP IP addresses can go 

FreeBSD initializes the above structures in two phases. 35 away, as well as arrive. If we detect that the DHCP 
First when a network interface is found, the ifhet, arpcom, IO_CONTROL is a deletion, we must mark the interface 
and first ifaddr structures are initialized first by the network down, and delete any routes using it. Additionally, we need 
layer driver, and then via a call to the if_attach routine. The to flush any fast-path connections using this interface back 
subsequent in_ifaddr structure^) are initialized when a user to the host; this is done by the ATKIfRouteFhish( ) function 
dynamically configures the interface. This occurs in the 40 (in atkfastpath.c.) 

in_ioctl and the in_ifinit routines. We port the FreeBSD ARP code to NT mostly as-is. For 

Interface initialization in the ATCP driver changes con- some reason, the FreeBSD ARP code is located in a file 
siderably from BSD, because in NT, many parameters are called if_ether.c. While we do not change the functionality 
obtained from the registry, rather than being set by IOCTL( of this file, we rename it to a more logical arp.c. The main 
) calls. Initialization still occurs in two phases, but the details 45 structures used by ARP are the Uinfo_arp structure and the 
are different: rtentry structure (actually part of route). These structures do 

ATKIflnit is called from the DriverEntry function when not require major modifications. The functions that require 
the ATCP driver is loaded. It scans the registry, looking for modification are defined here. 

all interfaces bound to TCP/IP. For each one, it allocates an An in_arpinput function is called to process an incoming 
IFACE structure, and does further registry scanning to so ARP frame. An ARP frame can either be an ARP request or 
obtain parameters for this interface. Once these are obtained, an ARP reply. ARP requests are broadcast, so we will see 
it calls if_attach( ), which allocates the ifaddr structure for every ARP request on the network, while ARP replies are 
the interface, and links it on to the BSD interface list. Then, directed so we should only see ARP replies that are sent to 
for each IP address specified in the registry for this interface us. This introduces several scenarios for an incoming ARP 
(there may be more than one), it builds an ifaliasreq structure 55 frame. 

containing the address and its netmask, and calls in_control First, an ARP request may be trying to resolve our IP 
with the SIOCAIFADDR command to allocate and initialize address. Under conventional circumstances, ARP would 
the in_ifaddr and sockaddr_in structures; this has the side reply to this ARP request with an ARP reply containing our 
effect of creating the routes for the interface. (Note however MAC address. Since ARP requests will also be passed up to 
that if the interface is specified in the registry to use DHCP 60 the Microsoft TCP/IP driver, we need not reply. Note 
there are no IP addresses at this point; in that case a flag is however, that FreeBSD also creates or updates an ARP 
set in the IFACE to indicate that DHCP is to be used.) cache entry with the information derived from the ARP 
Finally, if a default gateway is specified for the interface, a request. It does this in anticipation of the fact that any host 
call is made to ATKAddDefRoute (in file atkroute.c) to add that wishes to know our MAC address is likely to wish to 
this to the route tables. 65 talk to us soon. Since we need to know his MAC address in 

Note that so far, everything has been done from informa- order to talk back, we add the ARP information now rather 
tion in the registry; we do not yet have any contact with than issuing our own ARP request later. 
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Second, an ARP request may be trying to resolve someone contain the same information for both NT and FreeB SD, 

else's IP address. As mentioned above, since ARP requests and since the key used to search for an entry in the routing 

are broadcast, we see every one on the network. When we table will be the same for each (the destination IP address), 

receive an ARP request of this type, we simply check to see we port the routing table software to NT without any major 

if we have an entry for the host that sent the request in our 5 changes. 

ARP cache. If we do, we check to see if we still have the The software which implements the route table (via the 

correct MAC address associated with that host. If it is PATRICIA algorithm) is located in the FreeBSD file, radix.c. 

incorrect, we update our ARP cache entry. Note that we do This file is ported directly to the ATCP driver with insig- 

not create a new ARP cache entry in this case. nificant changes. 

Third, in the case of an ARP reply, we add the new ARP to Routes can be added or deleted in a number of different 

entry to our ARP cache. Having resolved the address, we ways. The kernel adds or deletes routes when the state of an 

check to see if there is any transmit requests pending for the interface changes or when an ICMP redirect is received, 

resolve IP address, and if so, transmit them. User space programs such as the RIP daemon, or the route 

Given the above three possibilities, the only major change command also modify the route table, 

to the in_arpinput code is that we remove the code which is For kernel-based route changes, the changes can be made 

generates an ARP reply for ARP requests that are meant for by a direct call to the routing software. The FreeBSD 

our interface. software that is responsible for the modification of route 

Arpintr is the FreeBSD code that delivers an incoming table entries is found in route.c. The primary routine for all 

ARP frame to in_arpinput. We call in_arpinput directly route table changes is called rtrequest( ). It takes as its 

from our ProtocolReceiveDPC routine (discussed in the 20 arguments the request type (ADD, RESOLVE, DELETE), 

NDIS section below) so this function is not needed. the destination IP address for the route, the gateway for the' 

Arpwhohas is a single fine function that serves only as a route, the netmask for the route, the flags for the route, and 

wrapper around arprequest. We remove it and replace all a pointer to the route structure (struct rtentry) in which we 

calls to it with direct calls to arprequest. place the added or resolved route. Other routines in the 

Arprequest simply allocates a mbuf, fills it in with an ARP 25 route.c file include rtinit( ), which is called during interface 

header, and then passes it down to the ethemet output routine initialization time to add a static route to the network, 

to be transmitted. For us, the code remains essentially the rtredirect, which is called by ICMP when we receive a ICMP 

same except for the obvious changes related to how we redirect, and an assortment of support routines used for the 

allocate a network buffer, and how we send the filled in modification of route table entries. All of these routines 

request. 30 found in route.c are ported with no major modifications. 

Arp_ifinit is called when an interface is initialized to For user-space-based changes, we will have to be a bit 

broadcast a gratuitous ARP request (described in the inter- more clever. In FreeBSD, route changes are sent down to the 

face initialization section) and to set some ARP related fields kernel from user-space applications via a special route 

in the ifaddr structure for the interface. We simply move this socket. This code is found in the FreeBSD file, rtsock.c. 

functionality into the interface initialization code and 35 Obviously this will not work for our ATCP driver. Instead 

remove this function. the filter driver portion of our driver will intercept route 

Arptimer is a timer-based function that is called every 5 changes destined for the Microsoft TCP driver and will 

minutes to walk through the ARP table looking for entries apply those modifications to our own route table via the 

that have timed out. Although the time-out period for rtrequest routine described above. In order to do this, it will 

FreeBSD is 20 minutes, RFC 826 does not specify any timer 40 have to do some format translation to put the data into the 

requirements with regard to ARP so we can modify this format (sockaddr__in) expected by the rtrequest routine, 

value or delete the timer altogether to suit our needs. Either Obviously, none of the code from rtsockx will be ported to 

way the function doesn't require any major changes. the ATCP driver. This same procedure will be used to 

Other functions in if_ether.c do not require any major intercept and process explicit ARP cache modifications, 

changes. 45 The functions which perform these updates are 

On first thought, it might seem that we have no need for ATKSetRouteCompletion( ) and ATKSetArpCompletion( ), 

routing support since our ATCP driver will only receive IP in the file atkinfo.c 

datagrams whose destination IP address matches that of one In FreeBSD, a route table is consulted in ip_output when 

of our own interfaces. Therefore, we do not "route" from one an IP datagram is being sent. In order to avoid a complete 

interface to another. Instead, the MICROSOFT TCP/IP 50 route table search for every outgoing datagram, the route is 

driver provides that service. We do, however, need to stored into the in_pcb for the connection. For subsequent 

maintain an up-to-date routing table so that we know a) calls to ip_output, the route entry is then simply checked to 

whether an outgoing connection belongs to one of our ensure validity. While we will keep this basic operation as is, 

interfaces, b) to which interface it belongs, and c) what the we require a slight modification to allow us to coexist with 
first-hop IP address (gateway) is if the destination is not on 55 the Microsoft TCP driver. When an active connection is 

the local network. being set up, our filter driver has to determine whether the 

We discuss four aspects on the subject of routing in this connection is going to be handled by one of the INIC 

section. They are as follows: 1) The mechanics of how interfaces. To do this, we consult the route table from the 

routing information is stored, 2) The manner in which routes filter driver portion of our driver. This is done via a call to 
are added or deleted from the route table, 3) When and how 60 the rtallocl function (found in route.c). If a valid route table 

route information is retrieved from the route table, 4) entry is found, then we take control of the connection and set 

Notification of route table changes to interested parties. a pointer to the rtentry structure returned by rtallocl in our 

In FreeBSD, the route table is maintained using an in pcb structure. 

algorithm known as PATRICIA (Practical Algorithm To When a route table entry changes, there may be connec- 
Retrieve Information Coded in Alphanumeric). This is a 65 u'ons that have pointers to a stale route table entry. These 

complicated algorithm that is a bit costly to set up, but is connections need to be notified of the new route. FreeBSD 

very efficient to reference. Since the routing table should solves this by checking the validity of a route entry during 
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every call to ip_outpuL If the entry is no longer valid, its It can do this because it no longer requires the original IP 
reference to the stale route table entry is removed, and an header. This is an absolute no-no with the NDIS 4 0 method 
attempt is made to allocate a new route to the destination. of handling network packets. The NT DDK explicitly states 
For the slow-path, this works fine. Unfortunately, since our that we must not modify packets given to us by NDIS This 
IP processing is handled by the INIC for the fast-path, this s is not the only place in which the FreeBSD code modifies the 
sanity check method will not be sufficient. Instead, we will contents of a network buffer. It also does this when per- 
iled to perform a review of all of our fast path connections forming endian conversions. Al the moment we leave this 
during every route table modification. If the route table code as is and violate the DDK rules. We can do this because 
change affects our connection, we flush the connection off we ensure that no other transport driver looks at these 
^' N i^ so mat we revert 10 slow-path processing using the 10 frames. If this changes, we can modify this code substan- 
TCP protocol code. This picks up the new route, and tially by moving the IP reassembly fields into the mbuf 
uses this to build a new template when the connection is later header. 

handed out to the INIC again. The function which performs Regarding IP output, only two modifications are made 

this scan-and-flush is ATKRouteFlush( ), in file atkfasl- The first is that since, for the moment, we are not dealing 

P , J. 0 ' ..„„.. 15 witn ,p options, there is no need for the code that inserts the 

Like the ARP code above, we need to process certain IP options into the IP header. Second, we may discover that 

types of incoming ICMP frames. Of the 10 possible ICMP it is impossible for us to ever receive an output request that 

message types, there are only three that we need to support. requires fragmentation. Since TCP performs Maximum Seg- 

6 ICMP - RE DIRECT, ICMP_UNREACH, ment Size negotiation, we should theoretically never attempt 

and ICMP_SOURCEQUENCH. Any FreeBSD code to deal 20 to send a TCP segment larger than the MTU 

with other types of ICMP traffic will be removed. Instead, An NDIS Protocol Driver portion of the ATCP driver is 

we simply return NDIS_STATUS_NOT_ACCEPTED for defined by the set of routines registered with NDIS via a call 

all but the above ICMP frame types. The following para- to NdisRegisterProtocol. These routines are limited to those 

graphs describe how we handle these ICMP frames. that are called (indirectly) by the INIC miniport driver 

Under FreeBSD, an ICMP_REDIRECT causes two 25 beneath us. For example, we register a ProtocolReceive- 

things to occur. First, it causes the route table to be updated Packet routine so that when the INIC driver calls Ndis- 

with the route given in the redirect. Second, it results in a call MIndicateReceivePacket it will result in a call from NDIS to 

back to TCP to cause TCP to flush the route entry attached our driver. 

to its associated in_pcb structures. By doing this, it forces The NDIS protocol driver initialization occurs in two 

ip_output to search for a new route. As mentioned in the 30 phases. The first phase occurs when the ATCP DriverEntry 

Route section above, we also require a call to a routine routine calls ATKProtoSetup. The ATKProtoSetup routine 

which reviews all of the TCP fast-path connections, and allocates resources, registers protocol and locates and ini- 

flushes any using the affected route. ti^s bound NICs. We attempt to allocate many of the 

Tn!rD C ^ u , aDd maoso& TCP > the ICMP - reared resources as soon as possible so that we are more 
UNREACH results in no more than a simple statistic update. 35 likely to get the memory we want. This mostly, applies to 

We do the same. allocating and initializing our mbuf and mbuf header, pools. 

A source quench is sent to cause a TCP sender to close its We call NdisRegisterProtocol to register our set of protocol 

congestion window to a single segment, thereby putting the driver routines. The location and initializion of bound NICs 

sender into slow-start mode. We keep the FreeBSD code is done by ATKIQnit, as described above 
as-is for slow-path connections. For fast path connections 40 After the underlying INIC devices have completed their 

we must flush the context back to the host, since we are not, initialization, NDIS calls our driver's ATKBindAdapter 

at least at the moment, handling congestion and slow-start function for each underlying device. This completes the 

on the INIC. interface initialization. 

The FreeBSD IP code requires few modifications when Receive is handled by the protocol driver routine ATKRe- 
porting to the ATCP driver, which are described in the 45 ceivePacket. Before we describe this routine, consider each 

paragraphs below. possible receive type and how it will be handled. As shown 

During initialization time, ip_init is called to initialize the in FIG. 7, the INIC miniport driver 200 is bound to the ATCP 

array of protosw structures. These structures contain all the transport driver 202 as well as the generic Microsoft TCP 

information needed by IP to be able to pass incoming data driver 205, and optionally others. The ATCP driver 202 is 
to the correct protocol above it. We strip the protosw array 50 bound exclusively to INIC devices, while the generic TCP 

to exclude unnecessary protocols. driver 205 is bound to the INIC as well as other conventional 

Changes made to IP input (function ip_jntr( )) are listed NICs, as indicated by its connection to a generic miniport 

°. w - , driver 208 as well as the INIC miniport driver 200. 

hist, since we only handle datagrams for which we are By binding the drivers in this fashion, we can choose to 
the final destination, we are never required to forward an IP 55 direct incoming network data to our own ATCP transport 

datagram. All references to IP forwarding, and the driver, the Microsoft TCP driver, or both. We do this by 

ip_forward function itself, are removed. IP options sup- playing with the ethemet "type" field as follows. To NDIS 

ported by FreeBSD al this time include record route, strict and the transport drivers above it, the INIC is registered as 

and loose source and record route, and timestamp. For the a normal ethernet card. When the generic TCP/IP transport 
timestarnp option, FreeBSD only logs the current time into 60 driver receives a packet from the INIC driver, it will expect 

Uie IP header before it is forwarded. Since we will not be the data to start with an ethernet header, and consequently 

forwarding IP datagrams, this seems to be of little use to us. expects the protocol type field to be in byte offset 12 If 

While FreeBSD supports the remaining options, NT essen- Microsoft TCP finds that the protocol type field is not equal 

tiaUy does nothing useful with them. to either IP, or ARP, it will not accept the packet So to 
There is a small problem with the FreeBSD IP reassembly 65 deliver an incoming packet to our driver, we simply map the 

t fnP 16 reassemblv ^ reuses & be *ter portion of data such that byte 12 contains a non-recognized ethemet 

the IP datagram to contain IP reassembly queue information. type field. Note that we choose a value that is greater than 
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1500 bytes so that the transport drivers do not confuse it with the ethemel type field is unique to us (no one else wiU want 

an 802.3 frame. We also choose a value that will not be it). Obviously this only works with data that is only deliv- 

accepted by other transport driver such as Appletalk or IPX. ered to our ATCP driver. For ARP and ICMP frames we 

Similarly, if we want to direct the data to Microsoft TCP, we instead copy the data out of the packet into our own buffer 

can then simply leave the ethernet type field set to IP (or s and return the packet to ND1S directly While this is less 

ARP). Note that since we will also see these frames we can efficient than keeping the data and returning it later ARP and 

choose to accept or not-accept them as necessary. ICMP traffic should be small enough, and infrequent 

Incoming packets delivered to ATCP only (not accepted enough, that it doesn't matter 

by MSTCP) include TCP, TTCP or SPX packets destined for The DDK specifies that when a protocol driver chooses to 

one of our IP addresses. This includes both slow-path frames 10 keep a packet, it should return a value of 1 (or more) to NDIS 

and fast-path frames. In the slow-path case, the TCP frames in its ProtocolReceivePacket routine. The packet is then later 

are given in their entirety (headers included). In the fast-path returned to NDIS via the call to NdisReturnPackets This can 

case, the ATKRecervePacket is given a header buffer that only happen after the ProtocolReceivePacket has returned 

contains status information and data with no headers (except control to NDIS. This requires that the call to NdisRetum- 

those above TCP). 15 p ac kets must occur in a different execution context. We 

Incoming packets delivered to Microsoft TCP only (not accomplish this by scheduling a DPC, or alternatively 

accepted by ATCP) are packets according to protocol not scheduling a system thread, or scheduling a kernel thread of 

suitable for the fast-path (non-TCP, TTCP or SPX packets) our own. A DPC requires a queue of pending receive buffers 

or packets that are not destined for one of our interfaces on which to place and fetch receive packets, 

(packets that will be routed). If in the above example, there 20 After a receive packet is dequeued by the DPC it is then 

is an IP address 144.48.252.4 associated with a 3com either passed to TCP directly for fast-path processing or it 

interface, and we receive a TCP connect with a destination is sent through the FreeBSD path for slow-path processing 

IP address of 144.48.252.4, we will actually want to send Note that in the case of slow-path processing we may be 

that request up to the ATCP driver so that we create a working on data that needs to be returned to NDIS (for 

fast-path connection for iL This means that we need to know 25 example TCP data) or we may be working on our own copy 

every IP address m the system and filter frames based on the of the data (ARP and ICMP). When we finish with the data 

destination IP address in a given TCP datagram. This is done we will need to figure out whether or not to return the data 

in the INIC mimport driver. Since the ATCP driver learns of to NDIS or not. This will be done via fields in the mbuf 

dynamic IP address changes in the system, we notify the header used to map the data. When the mfreem routine is 

INIC mimport of all the IP addresses in the system. 30 called to free a chain of mbufs, the fields in the mbuf will be 

Incoming packets delivered to both ATCP and Microsoft checked and, if required, the packet descriptor pointed to by 

TCP include ARP frames and ICMP frames There are the mbuf is returned to NDIS 

several circumstances in which the INIC will need to As noted in the section on mbufs above, we map incoming 

extra ^formation about a receive packet to the data to mbufs so that our FreeBSD port requires fewer . 

AICP driver. One such example is a fast path receive in 35 modifications. Depending on the type of data received this ■ 

which the ATCP driver needs to be notified of how much mapping will appear differently. 

data the card has buffered. To accomplish this, the first (and FIG. 9A shows incoming data packet 245 for a TCP 

sometimes only) buffer m a received packet will actually be fast-path connection. In this example, the TCP data 250 is 

a INIC header buffer. The header buffer contains status fully contained in a header buffer 255. The header buffer is 
■ information about the receive packet, and may or may not 40 mapped by the mbuf 257 and sent upstream for fast-path 

contain network data as welL The ATCP driver recognizes a TCP processing. In this case it is required that the header 

header buffer by mapping it to an ethernet frame and buffer be mapped and sent upstream because the fast-path 

inspecting the type field found in byte 12. We indicate all TCP code needs information contained in the header buffer 

TCP frames destined for us m this fashion, while frames that in order to perform the processing. When the mbuf in this 
^ e D D S !^ ed ^° rb0th ° Ur driVer * nd ^ Microsoft TCP driver « example is freed, the mfreem routine will determine that the 

™A o ™ arB md,cated wtoou' a header buffer. mbuf maps a packet that is owned by NDIS and will then 

FIG. 8A shows an example of an incoming TCP packet, free the mbuf header only and call NdisReturnPackets to 

whereas FIG. 8B shows an example of an incoming ARP free the data 

fram ^ ft u er P" 0 "* 8 * 1 * b y INIC - In FIG. 9B, we show incoming data packet 260 for a TCP 
NDIS has been designed such that all packets indicated so slow-path connection. In this example the mbuf 264 points 

via NdisMIndicateReceivePacket by an underlying miniport to the start of the TCP data 266 directly instead of to a header 

are delivered to the ProtocolReceivePacket routine for all buffer 268. Since a data buffer 270 will be sent up for 

protocol drivers bound to it. These protocol drivers can slow-path FreeBSD processing, we cannot have the mbuf 

choose to accept or not accept the data. They can either pointing to the header buffer (FreeBSD would get awfully 
accept the data by copying the data out of the packet 55 confused). Again, when mfreem is called to free the mbuf 

indicated to it, or alternatively they can keep the packet and it will discover the mapped packet, free the mbuf header and 

return it later via a call to NdisReturnPackets. By imple- call NDIS to free the packet and return the underlying 

menting it in this fashion, NDIS allows more than one buffers. Note that even though we do not directly map the 

protocol driver to accept a given packet. For this reason, header buffer with the mbuf we do not lose it because of the 
when a packet is delivered to a protocol driver, the contents 60 link from the packet descriptor. Note also that we could 

of the packet descriptor, buffer descriptors and data must all alternatively have the INIC miniport driver only pass us the 

be treated as read-only. At the moment, we violate this rule. TCP data buffer when it receives a slow-path receive This 

We choose to violate this because much of the FreeBSD would work fine except that we have determined that even 

code modifies the packet headers as it examines them in the case of slow-path connections we are going to attempt 
(mostly for endian conversion purposes). Rather than 65 to offer some assistance to the host TCP driver (most likely 

modify all of the FreeBSD code, we will instead ensure that by checksum processing only). In this case there may be 

no other transport driver accepts the data by making sure that some special fields that we need to pass up to the ATCP 
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driver from the INIC driver. Leaving the header buffer 
connected seems the most logical way to do this. 

In FIG. 9C shows a received ARP frame. Recall that for 
incoming ARP and 1CMP frames we can copy the incoming 
data out of the packet and return it directly to NDIS. In this 
case the mbuf 275 simply points to our data 278, with no 
corresponding packet descriptor. When we free this mbuf, 
mfreem will discover this and free not only the mbuf header, 
but the data as well. 

This receive mechanism may also be used for other 
purposes besides the reception of network data. For instance, 
the receive mechanism may be used for communication 
between the ATCP driver and the INIC. One such example 
is a TCP context flush from the INIC. When the INIC 
determines, for whatever reason, that it can no longer 
manage a TCP connection, it must flush that connection to 
the ATCP driver. It does this by filling in a header buffer with 
appropriate status and delivering it to the INIC driver. The 
INIC driver in turn delivers it to the protocol driver which 
will treat it essentially like a fast-path TCP connection by 
mapping the header buffer with an mbuf header and deliv- 
ering it to TCP for fast-path processing. There are two 
advantages to communicating in this manner. First, it is 
already an established path, so no extra coding or testing is 
required. Second, since a context flush comes in, in the same 
manner as received frames, it will prevent us from getting a 
slow-path frame before the context has been flushed. 

Having covered the various types of receive data at least 
for the TCP example, following are the steps that must be 
taken by the ATKProtocolReceivePacket routine. Incoming 
data is mapped to an ethernet frame and the type field is 
checked. If the type field contains our custom INIC type 
(TCP for example), and if the header buffer specifies a 
fast-path connection, allocate one or more mbufs headers to 
map the header and possibly data buffers. Set the packet 
descriptor field of the mbuf to point to the packet descriptor, 
set the mbuf flags appropriately, queue the mbuf, and return 
1. If the header buffer specifies a slow-path connection, 
allocate a single mbuf header to map the network data, set 
the mbuf fields to map the packet, queue the mbuf and return 
1. Note that we design the INIC such that we will never get 
a TCP segment split across more than one buffer. 

If the type field of the frame instead indicates ARP or 
ICMP, a mbuf with a data buffer is allocated, the contents of 
the packet are copied into the mbuf, the mbuf is queued, and 
return 0 (not accepted). If the type field is not INIC, ARP or 
ICMP type, ATCP does not process the packet, and so return 
0. 

The receive processing will continue when the mbufs are 
dequeued. At the moment we will assume this is done by a 
routine called ATKProtocolReceiveDPC. It will dequeue a 
mbuf from the queue, and inspect the mbuf flags. If the mbuf 
is meant for fast-path TCP, it will call the fast-path routine 
directly. Otherwise it will call the ethernet input routine for 
slow-path processing. 

The ATCP transmit path is discussed in the following 
paragraphs, beginning with the NDIS 4 send operation. 
When a transport/protocol driver wishes to send one or more 
packets down to an NDIS 4 miniport driver, it calls Ndis- 
SendPackets with an array of packet descriptors to send. As 
soon as this routine is called, the transport/protocol driver 
relinquishes ownership of the packets until they are 
returned, one by one in any order, via a NDIS call to the 
ProtocolSendComplete routine. Since this routine is called 
asynchronously, our ATCP driver must save any required 
context into the packet descriptor header so that the appro- 
priate resources can be freed. This is discussed further 
below. 
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Like the Receive path described above, the Transmit path 
is used not only to send network data, but is also used as a 
communication mechanism between the host and the INIC. 
Some examples of the types of sends performed by the 
ATCP driver follow. 

FIG. 10 illustrates a fast-path send. When the ATCP driver 
receives a transmit request with an associated MDL 300 
from a client such as a host application, it packages up the 
MDL physical addresses into a command buffer 303, maps 
the command buffer with a buffer descriptor 305 and a 
packet descriptor 308, and calls NdisSendPackets with the 
corresponding packet. The underlying INIC driver will issue 
the command buffer to the INIC. When the corresponding 
response buffer is given back to the host, the INIC miniport 
calls NdisMSendComplete which will result in a call to the 
ATCP ProtocolSendComplete (ATKSendComplete) routine, 
at which point the resources (data 313) associated with the 
send can be freed. We allocate and use a mbuf 310 to hold 
the command buffer. By doing this we can store the context 
necessary in order to clean up after the send completes. This 
context includes a pointer to the MDL as well as other 
connection context. The other advantage to using a mbuf to 
hold the command buffer is that it eliminates having another 
special set of code to allocate and return command buffer. 
We store a pointer to the mbuf in the reserved section of the 
packet descriptor so we can locate it when the send is 
complete. 

As described above, the receive process typically occurs 
in two phases. First the INIC fills in a host receive buffer 
with a relatively small amount of data, but notifies the host 
of a large amount of pending data (either through a large • 
amount of buffered data on the card, or through a large 
amount of expected NetBios data). This small amount of 
data is delivered to the client through the TDI interface. The 
client then responds with a MDL in which the data should 
be placed. Like the Fast-path TCP send process, the receive 
portion of the ATCP driver will then fill in a command buffer 
with the MDL information from the client, map the buffer 
with packet and buffer descriptors and send it to the INIC via 
a call to NdisSendPackets. Again, when the response buffer 
is returned to the INIC miniport, the ATKSendComplete 
routine will be called and the receive will complete. This 
relationship between the MDL, command buffer and buffer 
and packet descriptors are the same as shown in the Fast- 
path send section above. 

FIG. 11 illustrates a slow-path send. Slow-path sends pass 
through the FreeBSD stack until the ethernet header is 
prepended in ether_output and the packet is ready to be 
sent. At this point a command buffer will be filled with 
pointers to the ethernet frame, the command buffer will be 
mapped with a packet descriptor 315 and a buffer descriptor 
318 and NdisSendPackets will be called to hand the packet 
off to the miniport. FIG. 11 shows the relationship between 
the mbufs, command buffer, and buffer and packet descrip- 
tors. Since we will use a mbuf 320 to map the command 
buffer 322, we can simply link the data mbufs 325 directly 
off of the command buffer mbuf. This will make the freeing 
of resources much simpler. 

As shown in FIG. 12, the transmit path may also be used 
to send non-data commands to the INIC. For example, the 
ATCP driver gives a context to the INIC by filling in a 
command buffer 330, mapping it with a packet 333 and 
buffer descriptor, and calling NdisSendPackets. 

Given the above different types of sends, the ATKProto- 
colSendComplete routine will perform various types of 
actions when it is called from NDIS. First it examines the 
reserved area of the packet descriptor to determine what type 
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of request has completed. In the case of a slow-path discuss the "top end**— the TDI interface to higher-level NT 

completion, it can simply free the mbufs, command buffer, network client software. We make use of an NT facility 

!°i n r P '°, ,S ^ rD T^ ^ Ca f ° f 3 faS '; P \ lh calIed a mter ^ (*« nG T)- NT allows a special type 

I S ' rrf" 5 I C? faSt "f f " ° f ,± C 0f ^ (" flller driv ^') <° »««* itself "on top" of anomVr 

ir£: W h^i c ^^ 

wiU again he notified mat the command sent to ufe INIC has ^^^^t^Z 

The only query operation currently done by the ATCP SySt£m - ^ c fi,ter driver ma y e«her handle these 

driver is a query to obtain stats. This is done by the function ^quests ttself, or pass them down to the underlying driver 
ATKUpdateInicStats( ) in file atkinfo.c. Currently, the ATCP 10 Ulat me fil,er driver 15 aKached <°- Provided the filter driver 
driver recognizes four status indications from the lower completely replicates the (externally visible) behavior of the 
IN1C driver. These are handled by the function ATKStatus( underlying driver when it handles requests itself, the exist- 
), in proto.c. An NDIS_STATUS_MEDI A_DISCONNECT ence of ^ nl,er driver is invisible to higher-level software, 
status indication is received if the INIC driver detects that ° ur nlter driver attaches on top of the Microsoft TCP/IP 
the link status on an interface is down. We simply mark our 15 driver. This gives us the basic mechanism whereby we can 
corresponding IFACE structure as "down". An NDIS_ intercept requests for TCP operations and handle them in our 
STATUS_MEDIA_CONNECT status indication is driver instead of the Microsoft driver. The functions which 
received when a previously down link status changes back actually receive the request IRPs from higher levels are the 
to "up". We mark the corresponding IFACE as "up", and various dispatch routines in the file atcpinit c (this also 
also do a gratuitous ARP to advertise it. An NDIS_ 20 contains the initialization code). 

STATUS_RESET_START status indication is received However, while the filter driver concept gives us a frame- 
when the INIC driver has decided to issue a reset to the work for what we wanted to achieve, there were some 
INIC. This reset will destroy any TCP or other contexts on significant technical problems that were solved. The basic 
the affected interface (we have no way to recover a context issue is that setting up a TCP connection involves a sequence 
from a dead INIC): so we call the function ATKResetFlush( 25 of several requests from higher-level software, and it is not 
), in file atkfastpath.c, to abort any connections on the always possible to tell, for requests early in this sequence 
indicated interface. The interface is also marked down. An whether the connection should be handled by our driver or 
NDIS_STATUS_RESET_END status indication is the Microsoft driver. 

received when the INIC driver has reloaded and restarted an In a nutshell, this means that for many requests we store 
INIC after a reset. We mark the corresponding IFACE as 30 information about the request in case we need it later, but 
up- also allow the request to be passed down to the Microsoft 

We may not initiate INIC resets from the ATCP driver. TCP/IP driver in case the connection ultimately turns out to 
Instead, as noted in above, we may respond to reset status be one for which that driver should handle 
indications from the INIC driver below the ATCP driver. Let us look at this in more detail, which will involve some 

Similarly, we do not initiate any HALT operations from the 35 examination of the TDI interface. The TDI interface is the 
ATCP driver. NT interface into the top end of NT network protocol 

In a first embodiment, the INIC handles only simple-case drivers. Higher-level TDI client software which requires 
m-sequence data transfer operations on a TCP connection. services from a protocol driver proceeds by creating various 
These of course constitute the large majority of CPU cycles types of NT FILE_OBJECTs, and then making various 
consumed by TCP processing in a conventional driver. There 40 DEVICE_IO_CONTROL requests on these FILE 
are other complexities of the TCP protocol that are still in OBJECTS. 

this embodiment handled by host driver software: connec- There are two types of FILE_OBJECT of interest here 
tion setup and breakdown; out-of-order data, nonstandard Local IP address/port combinations are represented by 
flags etc. The NT OS contains a fully functional TCP/IP ADDRESS objects, and TCP connections are represented by 
driver, and a better solution is to enhance this so that it is 45 CONNECTION objects. The steps involved in setting up a 
able to detect our INIC and take advantage of it by "handing TCP connection (from the "active", client, side) are- 1) 
off" data-path processing where appropriate. Unfortunately, Create an ADDRESS object, 2) Create a CONNECTION 
we do not have access or permission to modify NT source. object, 3) Issue a TDI_ASSOCIATE_ADDRESS 
Thus the optimal solution above, while straightforward, is IO-control to associate the CONNECTION object with the 
not implemented immediately. We thus provide our own 50 ADDRESS object, 4) Issue a TDI_CONNECT IO-control 
custom driver software on the host for those parts of TCP on the CONNECTION object, specifying the remote address 
processing which are not handled by the INIC. and port for the connection. 

This presented a challenge: The NT network driver frame- Initial thoughts were that handling this would be straight- 
work does make provision for multiple types of protocol forward: we would tell, on the basis of the address given 
driver; but it does not easily allow for multiple instances of 55 when creating the ADDRESS object, whether the connec- 
dnvers handling the same protocol. For example, there are tion is for one of our interfaces or not. After which it would 
no "hooks" into the Microsoft TCP/IP driver which would be easy to arrange for handling entirely by our'code or 
allow for routing of IP packets between our driver (handling entirely by the Microsoft code: we would simply examine 
our INlCs) and the Microsoft driver (handling other NICs). the ADDRESS object to see if it was "one of ours" or not 
Our solution to this was to retain the Microsoft driver for all 60 There were two main difficulties, however. 
non-TCP network processing (even for traffic on our INICs), First, when the CONNECTION object is created, no 
but to invisibly "steal" TCP traffic on our connections and address is specified: it acquires a local address only later 
handle it via our own (BSD-derived) driver. The Microsoft when the TDI_ASSOCIATE_ADDRESS is done Also 
TCP/IP driver is unaware of TCP connections on interfaces when a CONNECTION object is created, the caller supplies 
we handle. 6S ^ op^e "context cookie" which will be needed for later 

The network "bottom end" of this plural path processing communications with that caller. Storage of this cookie is the 
was described earlier in the document. In this section we win responsibility of the protocol driver it is not directly deriv- 



08/15/2003, EAST Version: 1.04.0000 



US 6.434,620 Bl 

37 38 

able just by examination of the CONNECTION object itself. field that NT FILE_OBJECTS provide for driver use ) We 

If we simply passed the "create" call down to the Microsoft then proceed with connection setup and handling in our 

TCP/IP driver, we would have no way of obtaining this driver, using information stored in our "shadow" objects 

cookie later, if it turns out that we need to handle the The Microsoft driver does not see the connection request 

connection. s or subsequent traffic on the connection. 

Therefore, for every CONNECTION object which is If the connection request is NOT for one of our interfaces 

created, we must allocate a structure to keep track of we pass it down to the Microsoft driver. Note, however that 

information about it, and store this structure in a hash table we can not simply discard our "shadow" objects at this 

keyed by the address of the CONNECTION object itself, so point. The TDI interface allows re-use of CONNECTION 
that we can locate it if we later need to process requests on 10 objects: on termination of a connection, it is legal for the 

this object. We refer to this as a "shadow" object: it TDI client to dissociate the CONNECTION object from its 

replicates information about the object stored in the current .ADDRESS object, re-associate it with another and 

Microsoft driver. We also pass the create request down to the use it for another connection. Thus our "shadow" objects" 

Microsoft driver too, to allow it to set up its own adminis- must be retained for the lifetime of the NT FILE_ 
trative information about the object. , 5 OBJECTS: a subsequent connection could turn out to be via 

A second major difficulty arises with ADDRESS objects. one of our interfaces. 
These are often created with the TCP/IP "wildcard" address For an incoming or "passive" connection setup, NT 
(all zeros); the actual local address is assigned only later allows at least two ways of doing things. There are explicit 
during connection setup (by the protocol driver itself.) A TDI_LISTEN and TDLACCEPT calls defined in the TDI 
"wildcard" address does not allow us to determine whether 20 spec. There is also a callback mechanism whereby a TDI 
connections that will be associated with this ADDRESS client can arrange to be called when an incoming connection 
object should be handled by our driver or the Microsoft one . request for a given port/address is seen. 
Also, as with CONNECTION objects, there is "opaque" In point of fact, no existing TDI clients appear to use the 
data associated with ADDRESS objects that cannot be explicit TDI_LISTEN and TDI_ACCEPT calls and we are 
denved just from examination of the object itself. (In this 25 not handling them in the ATCP driver. All incoming con- 
case addresses of callback -functions set on the object by nections are made via the callback mechanism. 
TDL_SET_EVENT IO-controls.) Initial steps are similar to active connection handling. The 

Thus, as in the CONNECTION object case, we create a TDI client creates an ADDRESS object, then creates one or 

"shadow" object for each ADDRESS object that is created more CONNECTION objects and associates them with it It 
with a wildcard address. In this we store information 30 also makes TDI_SET_EVENT calls on the address object 

(principally addresses of callback functions) which we will to set up handlers for data input, disconnection errors etc 

need if we are handling connections on CONNECTION and in this case, it also registers one more handler for 

objects associated with this ADDRESS object. We store connection requests. All of these creations and associations 

similar information, for any ADDRESS object that is explic- are "shadowed" in the ATCP driver, as in the active con- ' 
ltly for one of our interface addresses, as it is convenient to 35 nection case. 

use the same structure for both cases. With this concept of Next, recall that the INIC driver knows about the IP 
"shadow" objects in place, let us revisit the steps involved addresses of our interfaces, and filters incoming IP packets 
in setting up a connection, and look at the processing based on this information. So any connection requests which 
performed in the ATCP driver. we see in the ATCP driver are known to be for our interfaces 
For an outgoing or "active" connection setup, the TDI to Now we process analogously to the Microsoft TCP driver- 
client first makes a call to create the ADDRESS object. for an incoming connection request (TCP SYN), we look for 
Assuming that this is a "wildcard" address, we create a a "best match" address object. All our shadow ADDRESS 
"shadow" object before passing the call down to the objects are kept in a table hashed by port for this purpose. 
Microsoft driver. An address object matches if its port number matches the 
rhe next step (omitted in the earlier list for brevity) is 45 destination port in the packet; a match of both port and IP 
normally that the client makes a number of TDI_SET_ address takes precedence over a match of port only 
EVENT IO-control calls to associate various callback func- Assuming a suitable ADDRESS object is found, we call 
tions with the ADDRESS object. These are functions that the connection handler function which the TDI client reg- 
should be called to notify the TDI client when certain events istered in that object with information about the connection 
(such as arrival of data, disconnection requests, etc.) occur. 50 request (most importantly, the remote address and port) If 
We store these callback finction pointers in our "shadow" the TDI client which created that ADDRESS object is 
address object, before passing the call down to the Microsoft prepared to handle this connection request, it responds with 
dn ^ Cr ' • _ a TDI_CONNECT IRP, plus a "connection cookie" which 

Next, the TDI client makes a call to create a CONNEC- should correspond to the "context cookie" of one of the 

HON object Again, we create our "shadow" of this object. 55 CONNECTION objects associated with this ADDRESS 

Next, the client issues the TDI_ASSOCl ATE_ object. We locate this CONNECTION object, mark it as 

ADDRESS IO-control to bind the CONNECTION object to "one of ours", and proceed with BSD code TCP protocol 

the ADDRESS object. We note the association in our processing to establish the connection, 

"shadow" objects, and also pass the call down to the As in the active connection case, all activity on this 

Microsoft driver. 60 connection is handled by the ATCP driver; the Microsoft 

Finally the TDI client issues a TDI_CONNECT TCP driver knows nothing about it. Conversely, incoming 

IO-control on the CONNECTION object, specifying the connection requests for interface addresses other than INIC 

remote IP address (and port) for the desired connection. At addresses are filtered out at the INIC level; the ATCP driver 

this point, we examine our routing tables to determine if this never sees such connections or any traffic on them, 

connection should be handled by one of our interfaces, or by 65 In some cases when an ADDRESS object is created, an 

some other NIC. If it is ours, we mark the CONNECTION explicit port number is specified by the TDI client. This is 

object as "one of ours" for future reference (using an opaque typically the case for services (such as NETBIOS or FTP) 
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which are preparing to respond to requests on well-known, The BSD structures have changed somewhat from their 

assigned ports. original forms. The inpcb structure has changed to use the 

In other cases, however, a port number of zero is given. Microsoft LIST_ENTRY definitions for queueing- and now 

In that case, the TCP protocol driver is required to assign a contains link fields for a new list of connections which are 

port number. Once again we run into the issue that, at the 5 being handled by the ATCP driver as opposed to the INIC 

time an ADDRESS object is created, we don't know if that The tcpcb fields have been substantially rearranged because 

address object is going to be used for connections on our a portion of the tcpcb structure is shared directly with the 

interfaces or others. In particular, there are problems in the INIC (DMA'd in and out when connections are migrated) 

case of an ADDRESS object created with both port and and some fields have been reduced from 4 to 2 bytes to 

address as wildcards. If we assigned an arbitrary ephemeral to conserve memory on the INIC. And the socket structure has 

port for the ATCP "shadow" object, we would run into fatal acquired many new fields, mostly concerned with fastpath 

problems with WINSOCK applications such as WINS processing; it has also lost a number of fields concerned with 

which create an ADDRESS object with no specified port, USTENing socket queues since the TDI passive connection 

and then query the address object to find what port was mechanism is radically different. 

assigned by the protocol driver. We would not know, in the is Note that the socket structure exists primarily for corn- 
case of a wildcard ADDRESS object, which port number to patibility with function calls made by BSD TCP code It has 
return for the query: ours, or the one assigned by the also become a repository for a number of new ATCP 
Microsoft driver. per-connection fields, but their location here is rather 
lnus, we have to ensure that there is a single, consistent, arbitrary, as they could equally well have been placed in the 
port-number space. To do so, we must always allow the 20 inpcb or tcpcb structures. Its use differs considerably from a 
Microsoft driver to create its ADDRESS object (and hence BSD socket. In the ATCP driver, a socket structure is 
assign its port), and then catch the completion of the allocated only when a connection is set up and has no 
Microsoft create operation. At that point, we issue a TDI_ existence apart from TCP connections. Also unlike BSD 
QUERY_INFORMATION request on the object to obtain there is no correspondence between this kernel-level socket 
the port number that was assigned by the Microsoft driver, 25 structure and any user-level socket. The "sockets" provided 
and plug it into our "shadow" ADDRESS object. by the Microsoft WINSOCK library are an entirely separate 
A consequence of this is that, even in the case of an abstraction, which mimic the user-level behavior of BSD 
ADDRESS object which is explicitly for one of our sockets by creating and manipulating ADDRESS and CON- 
interfaces, we still allow the Microsoft driver to create a NECTION file objects in a library layer above the TDI 
corresponding ADDRESS object if no port number was 30 interface. 

specified, in order to ensure a single consistent port-number The mbuf structure has also changed quite considerably 

spa f e - from BSD. It is now denned in alkmbuf.h.. There are no 

The structures used for ATCP ADDRESS and CONNEC- "small" (128 byte) mbufs in the ATCP driver ATCP mbufs 

TION objects are defined in the file obmgrJi. are purely headers, whose m_data fields point to actual data 

Most of the code for dealing with our shadow objects is 35 blocks (of various kinds) elsewhere. In fact, ATCP mbufs 

in the file obmgr.c; this contains finctions which handle fall into 4 categories: 1) T_NDIS mbufs, which map NDIS 

object creation, cleanup and close, as well as the TDI_ buffers from the low-level INIC driver 2) MT HEADER 

A SSOCIATE_ADDRESS, TDI_DISSOCIATE_ mbufs, which point to 2K host buffers, similar" to BSD's 

ADDRESS and TDI_SET_E VENT_HANDLER "cluster" mbufs, 3) MT_HOSTMDL mbufs which map 

^° Dtr u 40 MDLs from a TDI— SEND, 4) MT_HCMD mbufs which 

Note that we catch the completion of most of the requests map outgoing NDIS command buffers 

of interest, and in fact much of our processing is done in our The m_hdr and pkthdr components of the mbuf struct 

completion handlers. In general, we want to proceed with have been retained (though all mbufs now contain a pkthdr) 

our processing only after we know that the Microsoft driver but many new fields have also been added, on a somewhat 

has successfully completed. 45 a d-noc basis as they were needed. 

Adapting the BSD TCP code to NT has been accom- For Operation Completion, the BSD TCP code uses a 

phshed first by fairly straightforward resolution of clashes traditional UNIX approach. All processing occurs in the 

between BSD and Microsoft definitions, winnowing down (kernel) context of the process owning the socket for the 

the BSD system header files to a minimum, and converting connection. Each request (for connection, data output, etc ) 

requests for various common OS resources (memory 50 executes in the process context until it reaches a point where 

allocation, copying, etc.) from BSD to NT services. it is necessary to wait for resources or activity. At that point 

Areas where substantial redesign was needed to change the process sleeps. When it is later woken (by an interrupt 

from the socket paradigm to the TDI interface are discussed timer, etc.), processing resumes, still in the process context' 

m r re J^ !U L bel0 Z' we have noted ekew here, the NT paradigm is more 

hor BSD Data Structures we have, as noted earlier, 55 asynchronous. A request is initiated by receipt of an IRP but 

attempted to keep the code as close to the BSD base as is once processing has been started and the IRP is placed 'into 

possible. Thus for each connection, we have a socket a pending state, me initiating thread is free to go about other 

structure, an in_pcb structure, and a tcpcb structure. These business. At the point where we want to complete the IRP 

are defined in the usual BSD headers: socketvar.h, in_pcb.h, we no longer have any reference to the originating thread' 

and tcp_varJi respectively (though tcp_var.h has moved to 60 and indeed, that thread may not even be explicitly waiting 

a common include directory, since it is also used by INIC for the particular completion. The question arises, therefore: 

co l e ) in what context will IRP completions run in the ATCP 

bach connection also has an ATCP connection object driver? 

(ATCONN, defined in obmgr.h.) This means there are a total The solution we have chosen is a DPC This is an NT 

of four linked data structures for each connection: this is 65 kernel facility that allows a call to a function to be scheduled 

unpleasantly unwieldy, and would certainly not have been (to run in an arbitrary thread context) as soon as the 

the approach had we been designing from scratch. processor on which the DPC request was made becomes 
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free. When we create our CONNECTION objects, each one A further NT wrinkle here is that TDI clients register a 

has a DPC object initialized in it. Then, the BSD "wakeup" disconnection handler function for connections and we 

functions (sorwakeup, sowwakeup, soisconnected, etc.) are normally need to call this, as well as completing anv 

reimplemented as code which schedules a DPC on that TDI_DISCONNECT IRP, when closing a connection We 

connection (and also sets flag bits to indicate which event s ^ need to call the disconnection notify function when the 

Thermal function which is run by the connection DPC -jZ S^t! ^ T ^"f^ in that case the tcp_ 

is ATKConnDpc( ); the code for this I in atksocket.c. This ^ ^cantrcv more ( ) which again translates to 

DPC function is central to the operation of the ATCP driver: 4 ^ eduLn | °f ^ e connecUon DPC wuh appropriate flags, 

most IRP completions, as well as slow-path data indications fil ^ n ° l f cal, ° n * done b ? ATKNoUfyDisConnecUon (in 

and delivery, run in this DPC context. 10 fiIe atksoc ket.c); ibe determination of whether, and with 

In a BSD system, Active Connection Setup starts with what fla S s ' we should call toe disconnection notify function, 

creating a socket. In NT, however, it starts with creating ^ made bv a of tests at ^ slart of ^is function. 

ADDRESS and CONNECTION objects, as described in ^ next few paragraphs describe slow-path output Data 

section 10.2. output on a connection is initiated by a TDI_SEND request 

The final step is a TDI_CONNECT IO-control on the 15 on the CONNECTION object; the request IRP points to an 

connection object. This results in a call to the function MDL describing the data to be sent. The request results in a 

ATKConnect( ), in the file afktdi.c. After some initial checks, call '° ATKSend (in file atktdi.c.) This locks the connection, 

this calls the function ATKSoCreate (in file atksocket.c) and after some initial checks calls ATKMapMdlToMbuf, 

which allocates socket, inpcb and tcpcb structures for the located in the file atkmbuf.c. The latter allocates an ATCP 

connection and links them together in the ways expected by 20 mbuf to map the request; a pointer to the IRP is saved in a 

the BSD code (and also links this assemblage to the con- field in the mbuf. Note that, unlike BSD, we do NOT copy 

nection object.) At this point, we now have data structures in data from the request; we simply map the request MDL with 

a form which is usable by BSDTCP code. We simply call an mbu f header. Also, there is no notion in ATCP of a 

c P _usrreq( ) with PRU_CONNECT to cause the ^nnec- « watermark " : any TDI_SEND request is always accepted 

Uon to be initiated «d pend the TDUCONNECT IRP, „ an d queued. The TDI rule is that the TDI SEND IRP is no, 

saving a pointer to it in the connection object. The BSD code „„ m „i„,„j „„, i ,u , r . .• . ,, , " "~ "" l 

then proceeds essentially as it would in a BSD system, and '° m P leted ™* he da «! &om 11 ls acknowledged by 

once the connection is established, it calls soisconnected( ). ? e 50 thro u tUm g of send requests is accomplished in NT 

In the ATCP world, this translates to a fuinction which by defernn g m eir compleuon, rather than blocking the 

schedules a DPC on the connection; when the sender as in BSD. 

ATKConnDpc( ) function runs, it completes the TDI_ 30 The mbuf is then queued on to the socket send buffer, and 

CONNECT IRP. we caJ l tcp_usrreq( ) with PRU_SEND to cause the BSD 

For Passive Connection Setup, there are no "listening" Tc -P code t0 actually send the data (this in fact results in a 

sockets in the NT world. Large sections of tcp_input( ) ca " t0 tcp_output.) The connection is then unlocked, 

concerned with listening sockets are #ifdef'd out. The usual checks are made in tcp_output to determine 

Instead, when we detect an incoming SYN segment in 3S whether a frame can be sent. If possible, we build a buffer 

tcp_input( ), we call ATKPassiveConnect( ), which is containing MAC, IP and TCP headers, followed by the data, 

located in the file atktdi.c. This function implements the Details of this differ somewhat from regular BSD. The mbuf 

callup described above to locate a connection object to use we use for output is an MT_HEADER mbuf, which points 

for the connection. Assuming one is found, we call to a 2K buffer in host memory. This is always enough to 

ATKSoCreate( ), which as in the active connection case 40 contain all the protocol headers plus a maximum-sized 

allocates socket, mpcb and tcpcb structures for the connec- amount of payload, so we construct the entire packet in a 

Hon and hnks tbem^ as reqmred by the BSD code. We then sing i e mbuf . We aJJow space at tne fron , of ^ buffef for ^ 

save Uie returned TOLCONNECT IRP ,n our connection protocol headerS; and tnen caU m copymdldata( }> ]ocated 

objec and re urn to tcp_inpu ( ). Processing then continues ta atkmbuf . C) to data fe^ne TOI SEND MDLs 

as if the newly-created socket were one of the "spawned" . c mm „j k a .u i . 7"^ i ■ , 

sockets from a BSD "listening" socket, and once the con- 45 ™™f by mbufs qUCUed 0n J he ^ send b nffer«to the 

nection is established, the BSD code calls P ac ^t we are construcUng. The mbuf contammg the output 

soisconnected( ).which schedules a DPC which completes packet K then paSSed down 10 iP-O^K ) as usual, 

the IRP. Later, when data has been ACK'd by the peer, there is a 

Disconnection in the NT world is not signaled by a 'close' cdl from tc P_input( ) to sbdrop( ) on the socket send buffer, 

on a descriptor, instead there is an explicit TDI_ 50 The sbdrop code (in atksocket.c) performs processing essen- 

DISCONNECT IO-control. tially similar to vanilla BSD code (though it has been 

The TDI_DISCONNECT call results in a call to ATK- somewhat simplified): it adjusts m_data and m_len fields in 

Disconnect (in file atktdLc.) Handling is somewhat similar m e mbufs chain it is operating on, and if any mbufs are 

to that of active connection setup: we pend Ihe IRP, save a entirely consumed, it calls m_free( ) to release them, 

pointer to it in our connection object, and call tcp_usrrcq( ) 55 The m_free( ) fuinction (in file atkmbuf.c) has been 

to initiate the disconnection handshake. modified to recognize the new ATCP mbuf types: when it 

Note that if the context is currently on the INIC (i.e. we detects that the mbuf is one which maps a TDI_SEND 
are in fastpath state), we must flush the context from the MDL, it schedules a DPC to cause the associated TD1_ 
INIC before proceeding with disconnection. In that case we SEND IRP (which we saved in the mbuf before queueing it 
note the disconnection in socket flags and issue a flush but 60 on the socket send buffer) to be completed, 
do not call the BSD code here. The disconnection will be The next few paragraphs describe slow-path input pro- 
done when the flush completes. cessing is largely unchanged from BSD up to the point 

Once the BSD code has completed the disconnection where the tcp_input( ) code queues the mbuf on to the 

handshake, it calls soisdisconnected( ). In the ATCP driver socket receive buffer with sbappend( ), and calls 

this translates to a scheduling of the connection DPC; the 65 sorwakeup( ). 

DPC function completes any pended TDI_DISCONNECT At that point things become very different. In NT there is 

'^ p - no process sleeping on the socket receive buffer to be woken 
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up and copy out the data. Instead, the sorwakeup( ) call command identifying the context to flush. The final cleanup 

translates m ATCP terms into a scheduling of the connection is done in ATKFastPathCleanup( ); this may be called from 

atkS ,?t C VSl^u H *™*«sentD.Ui (in file ATKFastInpul( ) on receiving a "flush" frame, or from either 

atktdi.c) to deal with the data which has been appended on of ATKFastReceiveDone( ) or ATKFastSendDone( ) on 

° The ba^ic approach is that if we currently have an MDL 5 " C ° mpleti0D ; < RecaU f f "? m above *»l ** 

we copy the data into it and complete it /possible. If wl ^I'TJT^ completions of all outstanding data 
don't have an MDL, or have data left over after copying into ^ T ' Cm 

the one we completed, we will indicate data unless we're in - ? S . D C ° de 000131,15 manv 10CTL handlers for obtain- 
a state where we have already indicated and are expecting an ™ g * ,atlstICS - However, these have no direct correspondence 
MDL from a TDI_RECEIVE shortly. The ATKPresentData( 10 to . lhe NT statistics queries, and additionally, we must deal 
) fuinction is also cognizant of fastpath, and will call witl1 tbe fact mal ^ere are two parallel protocol stacks. In 
ATKDataRequest (in file atkfastpath.c) to hand out MDLs to manv cases > lne overaU statistics being requested are made 
the IN1C when appropriate. U P °f components from both stacks. Therefore, we do not 

Data which has been consumed, either by copying into an use anv °f tne BSD IOCTL handler code. 
MDL or by being taken by an indication, is dropped from the 1S Instead we arrange to catch completion of the various NT 
socket receive buffer with sbdropo. This calls m_free( ) statistics queries, which are IOCTL_TCP_QUERY_ 
once an mbuf has been completely consumed. The m_free INFORMATION_EX requests on CONTROL FILE_ 
function has been enhanced to know about the ATCP flavors OBJECTS, so that we can merge information from our 
of mbufs; it detects that this receive mbuf is actually one driver with that returned from the Microsoft driver, 
mapping NDIS buffers, and returns them to NDIS. 20 The functions for doing this are in atkinfo.c. Note that for 

The fast-path data pathways and connection handout and certain statistics, only the INIC has the exact values, since it 
flush were discussed conceptually earlier, so we will now consumes protocol headers internally for fastpath 'connec- 
simply identify the actual code which implements these tions. Therefore in order to obtain the ATCP information to 
functions, adding a few notes as we go. Most of the fastpath merge with the information from the Microsoft driver, we 
code is in file atkfastpath.c; all functions mentioned in this 25 need to query the INIC. An Alacritech-specific OID_INiC_ 
section are in this source file unless noted otherwise or GET_STATS is defined for this purpose, and used in 
earlier identified. ATKUpdateInicStats( ). We notice from tracing that NT is 

Fast-path input frames are identified as such in ATKRe- astoundingly profligate and inefficient in its use of stats 
ceiveDpc (file proto.c), and handed to ATKFastInput( ). queries (a netstat invocation, for example, may result in 
There, if it is a data frame or header, we simply queue it on 30 literally thousands of repeated queries!), so we keep a 
the socket receive buffer and call ATKPresentData( ); as timestamp of last query and repeat the [NIC query only after 
noted in the previous subsection, this knows about fastpath a reasonable time has elapsed since the previous one 
and is prepared to hand out MDLs to the INI C if appropriate . In most places where a structure needs to be allocated on 
The completion of fastpath receive MDLs is handled by the ATCP driver for memory allocation, we are just calling 
ATKFastReceiveDone( ). 35 the basic NT ExAllocatePoolo function. We don't at this 

Fast-path output originates in ATKSendo; there, if the point have a good feel for how efficient (or otherwise) the 
connection is in fastpath state we do not do the slow-path NT kernel memory allocation code is: if profiling later 
processing described in section 11.6. Instead we map the shows that it is worth while, we could adopt the approach of 
MDL with a data request structure (ATKDR, defined in keeping our own (hopefully more efficient) freelists of our 
socketvar.h), save the IRP in that request, queue the request 40 commonly-used structures. 

on the socket, and call ATKFastSendStartlo (file This might be particularly relevant if connection setup 
atkfastpath.c.) If there are currently less than the maximum overhead proves to be an issue, since three separate struc- 
allowed outstanding send requests active on the INIC, this tures (socket, inpeb and tepeb) need to be allocated for each 
calls ATKDataRequest( ) to hand the send MDL out to the connection. Rather than doing three separate allocations, we 
INIC. Completion of fastpath send requests is handled in 45 could keep a free pool of groups of these, already linked 
ATKFastSendDone( ). appropriately. 

The decision to hand a connection out to the INIC is made We have taken the pragmatic approach of implementing 
in the ubiqmtous DPC fuinction ATKConnDpc( ), either only the functionality that we have actually observed to be 
immediately when a connection is set up, or later when the used, in tracing and extensive testing. However, there are a 
connection is determined to be in a suitably quiescent state. 50 number of other features which may be derived from the 
This is a natural place to control the handout, since the DPC TDI spec or inferred from the Microsoft TCP code which 
finction is centrally involved in almost any data activity or have never been observed to be used, and we have omitted 
state transition which affects the connection. The initial them for simplicity. These include: Out-of-band data TDI_ 
handout message is sent by ATKStartMigration( ). When the LISTEN and TDI_ACCEPT calls, IOCTLs for setting inter- 
N -HJ' L lnterlock frame arrives it is handled by 55 faces up and down, IOCTLs for setting security information 
ATKHasProvisionalContext( ); this does some checks to (although registry keys for security features are imple- 
ensure that migration should still proceed, and if so, calls mented on a separate pathway), and a number of "hidden" 
ATKCompleteMigration( ) to send the second-half handout registry parameter keys. 

command. Completion of this is handled by As with conventional networking cards, the Alacritech 
ATKCompleteMigrationDone( ), which does a few more 60 INIC employs an associated device driver. This document 
checks and then sets the fastpath state of the connection to describes the device driver used with the Microsoft Win- 
"established". There is an ordered set of fastpath states, dows NT, and 9x operating systems, 
rather analogous to the TCP finite -state machine states, Network device drivers used in Microsoft operating sys- 
defined in socketvar.h: SO_FPNONE to terns conform to the Network Driver Interface Specification 
SO PFCLEANUP. 6S (ndjs) defined by Microsoft. NDIS provides a set of 

Origination of a flush from the host side is done by the standard entry points used for initialization, query and set 
function ATKFlushContext( ) which simply sends a flush functions (IOCTLS), sending and receiving data, and reset 
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and shutdown. NDIS also provides a set of library functions 
used to interact with the operating system. These functions 
include routines used to allocate memory, access PCI IO 
space, allocate and manage buffer and packet descriptors, 
and many other functions. An NDIS driver must be imple- 
mented exclusively within the bounds of NDIS and not 
make any calls to NT or 9x operating system routines 
directly. 

An NDIS NIC driver is used as a means of communica- 
tion between upper level protocol drivers (TCP/IP, 
Appletalk, IPX) and a specific networking device. For 
example, when the TCP/IP protocol driver, wishes to send an 
ethernet packet, the packet is passed to the NIC driver via the 
standard NDIS transmit interface. The NIC driver is respon- 
sible for interacting with its associated hardware to ensure 
that the packet is transmitted on the network. 

As shown in FIG. 13, the INIC miniport driver 200 is 
implemented as a standard miniport driver is connected to 
the INIC 50 over the PCI bus 57. The INIC has four network 
connections 340 in this embodiment. 

As mentioned above, we can reduce the number of 
interactions between the INIC device driver and the INIC 50 
(nicknamed Simba) by passing multiple buffers to the INIC 
in a single write, and allocating a physically contiguous 
chunk of memory and dividing it into several buffers. We 
also define four types of buffers. Header buffers, which 
contain information about received data as well as possibly 
the data itself (if the data is small) data buffers, which are 
always accompanied by a header buffer, which contain large 
chunks of received data, command buffers, which contain 
information about data that is to be sent, and response 
buffers, which contain information about command buffers 
that have just been completed. 

Header buffers are 256 bytes, data buffers are 2 k, 
command buffers are 512 bytes, and response buffers are 32 
bytes. Thus, in a contiguous 4 k page, we can allocate 16 
header buffers, 2 data buffers, 8 command buffers, or 128 
response buffers. Thus in a single write we can pass 16 
header buffers, or 2 data buffers, or 128 response buffers off 
to the INIC. We choose a 4 k buffer size because that is the 
page size for NT. Theoretically, NT should allow us to 
allocate larger blocks of contiguous memory, but likelihood 
of the allocation failing increases past the page size. We call 
this 4 k page a Simba Data Buffer (SDB). 

Let's say that we pass 16 header buffers off to the INIC. 
These header buffers will be relumed to us as data arrives, 
and are likely to be passed upstream to the ATCP driver. 
They will then be returned to us, out of order, at a later time 
at which point we can mark them as free. Before we can 
re-use the SDB, however, all of the buffers (header, or data, 
etc) within that SDB must have been returned to us. Since 
ATCP can return them in any order we need a way of 
keeping track of which buffers within an SDB are free and 
which are still in use. We do this by maintaining a 16-bit 
bitmask. Bits are cleared when the corresponding buffers are 
handed to the INIC (all 16 for header buffers, just 2 for data 
buffers), and then reset when the buffer is freed. When the 
bitmask is full, the SDB can be re-used. 

Note that 16 bits is not enough to manage the 128 
response buffers. It is not necessary to keep track of the 
response buffers since they are never passed upstream. For 
response buffers, we just maintain a circular queue of two 
SDBs. When the INIC uses all of the buffers in one response 
SDB, we pass it back to the INIC and jump to the other. Note 
also that while the INIC driver also uses SDBs for command 
buffers, command buffers are only passed to the INIC one at 
a time. Furthermore, as discussed elsewhere, the ATCP 
driver allocates and manages its own separate set of com- 
mand buffers. 
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As we've noted, we must maintain a bitmask for SDB. We 
need to maintain other information about an SDB as well. 
This information includes the virtual and physical address of 
the SDB, linked list pointers, the type of SDB (header, data, 
5 etc), the current offset within an SDB (next expected header/ 
response buffer), etc. We keep all of this information in a 
structure that we call SDBHANDLE. 

We have a unique challenge in the INIC driver. Unlike 
other NIC drivers, the INIC driver may be receiving data 
to that has already been acknowledged by the INIC network 
adapter. This means that once data has been received by the 
driver, it cannot be dropped. This in turn requires that all 
resources needed to receive a packet are allocated before a 
header and/or data buffer is passed off to the INIC. Included 
is in these resources are NDIS buffer and packet descriptors. 
NDIS buffer and packet descriptors are used to map a 
network buffer when being passed via NDIS. The-packet 
descripJors.represent-a-received-framerand"can~be made up 
of "multiple phj^g^.buffers ^^ch-r epre^nted^by^NDIS 
20 buffer_descnptor.-N6te-mat-as-pointed^ut"arjoverevery 
receivedlrame is giw^to.us-via-a- header'bufferrand'triere 
may'or-may-r^Cbe-an-assocjated-data-buffer'with - itrTh"is- 
means? thaffor every header buffer givejijo_the_lNIC_we 
mu^prej^Uocate"a"p^k£rdejcriptor_and.a.buffer_descriptor 
25 (sjrjcejhejwaderbuffer maylje sent upstream), whiletfor-^ 
every data buffer we must only pre'-allorate ^Jjufferdescn p--^'' 
torrSince these resowces^^jpreiallpcatea^^^eed-a-place 
to store mem until : .th"gheaderand/6T data buffers are returnecf 
to us^Agajn,^e;mMnlain -a-stTucmr e~callel3rth"e^SDBDESC 
(SDB Descriptor) structure for every header_and:dataT>u"ffer 
given jo^Jhejic^^We in^lude _ l'6"of these struc tures in the 
SDBjlANPLE-Structure,-one-for"eacirheader buffer in an 
SDB"(14 are not used for data buffers). 

We maintain several queues of SDBs for each adapter in 
the system. These queues are named and described as 
follows: 

AllSDBs is a linked list of all SDBs allocated for the 
adapter. Used to locate and free SDBs when the driver is 
halted. 

FreeSDBs is a linked list of SDBs which are free for use 
(bitmask is OxFFFF). HdrQueue is a linked list of header 
SDBs. Used by the SimbaReceiveDpc routine described 
below to locate newly received frames. 

CmdQueue is a list of free command buffers which can be 
sent, when needed, to the INIC. 

RspQueue is a circular queue of SDBs that contain 
response buffers used by the INIC. Note that we do not 
maintain a queue of data buffer SDBs. Data buffer SDBs are 
allocated from the free queue and given directly to the INIC. 
They are returned to us attached to header buffers so we do 
not need to keep track of them ourselves. This is described 
further below. 

As shown in FIG. 14, in a given system, we maintain a 
single INIC driver 200. Associated with that INIC driver 
may be a number of INIC cards, each of which may contain, 
for example, four adapters. To keep track of this the driver 
maintains a structure that represents the driver called the 
SIMBA_DRIVER structure 350. The SIMBA_DRIVER 
structure is pointed to by the global variable SimbaDriver. 
Among the fields in the SLMBA_DRIVER structure is a 
pointer to a linked list of SIMBA_CARD structures (355, 
357), each one representing an INIC PCI card. The 
SIMBA_CARD contains information about a particular 
card. This includes the state of the card (UP, DOWN, FAIL), 
the PQ slot number, the number of adapters on the card, the 
number of adapters initialized on the card, the number of 
adapters baited on the card, and other information. It also 
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contains a 4-entry array of ADAPT structure pointers (340. 
342). For each adapter on the card (there may be less than 
4), an entry in this array is filled in with a pointer to the 
ADAPT structure which is used to represent that particular 
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Note that that there is a PCI configuration space for every 
adapter on a IN1C card (four for a four-port card). Thus we 
go through PCI configuration every time our initialization 
routine is called. There is one catch to this. While there is a 



adapjer.-TheADACTstr ucture is the p rimary st^ture-in they configuration space header for every adapter, the bus master 
IM(^dny^and_conlains,-among-omer-lhmgs,,me -NDIS^" bit in the command register is only enabled for multifunction 
handle associated with the inter face, a back pointer to the - - - — 
card stnicture7the'iaaexof "iheTdapter o n theca rd, alSbinter 
tp-me-base-PCl^afesTrjfThe INIC^registeTsTTesoWces^, 
associated with the interface,-etc.-RG-14 shows-an-imple: 



mentation having four adapters (344—351) on each of two 
iNIC.cardsr^=- 

Every NT driver has a DriverEntry routine. For NDIS 
drivers, the main purpose of the DriverEntry routine is to 
register all of the expected driver entry points with NDIS. 
These entry points include Miniportlnitialize (called during 1S 
interface initialization), MiniportSend (called when a pro- 
tocol driver wishes to send a frame), MiniportlSR (called 
when an interrupt occurs), MiniportHalt (called when the 
driver is halted), and others. 

We define Simbalnitialize as the Miniportlnitialize rou- 20 
tine for the 1NIC driver. The Simbalnitialize routine is called 
once for every INIC adapter in the system. Recall that an 
INIC adapter is an interface found on an INIC card. Thus 
Simbalnitialize can be called up to four times per card. The 
purpose of the Simbalnitialize function is to allocate and 25 
initialize the ADAPT and optionally the SIMBA_CARD 
structure, allocate resources needed by the interface, and 
perform any hardware initialization required to make the 
interface operational, as described in more detail below. 

The oemsetup installation script, described below, stores 30 
a number of parameters into the registry for each INIC 
adapter installed in the system. These parameters include the 
following query registry parameters: 

CardBase — This parameter serves as a unique identifier 
for the INIC card. This is set to the PCI slot number with the 35 
PCI bus number OR'd into the top bits. 

CardHndex — Index of the adapter on the card (0-3 for the 
four port INIC). 

CardSize — Number of adapters on the card. 
BusNumber — Bus number on which the card resides. 40 
SlotNumber — PCI slot number of the card 
FunctionNumber — PCI function number of the adapter 
(0-3 for the four port INIC). 

NetworkAddress — An optional, administrator defined, 
network address. 

As noted above, the Simbalnitialize routine will be called 
four times per four-port INIC. For each time that it is called, 
we must allocate an ADAPT structure. On the other hand, 
we must only allocate a single SIMBA_CARD structure to 
represent the entire card. This is why we read the registry 
parameters before allocating the ADAPT and SIMBA_ 
CARD structures. Having read the registry parameters, we 
search through the list of already-allocated SIMBA_CARD 
structures looking for one that has the same CardBase value 
as the adapter that we are initializing. If we find one, we 5S 
simply link our new ADAPT structure into the Adapt array 
field of the SIMBA_CARD structure using the Cardlndex 
parameter. If we do not find an associated SIMBA_CARD 
structure, then we allocate a new one, link in our ADAPT 
structure, and add the new SIMBA_CARD structure to the 60 
Cards field of the SIMBA_DRIVER structure. 

Before the ATCP driver can talk to the INIC card it must 
configure the PCI configuration space registers. This 
involves calling the necessary NDIS functions to read the 
device and function ID's (used to verify that the information 65 
obtained from the registry is correct), read the memory base 
register, read the IRQ, and write the command register. 
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device 0. This can pose a problem. Assume that we have a 
four-port INIC, but the administrator has removed device 0. 
When we initialize PCI configuration space for devices 1, 2, 
and 3, bus mastering will not be enabled and none of the 
adapters will work. We solve this by enabling bus mastering 
for device 0 every time any of the interfaces is initialized. 

The next step in the INIC initialization is to allocate all the 
resources necessary for a single interface. This includes 
mapping the memory space obtained from the PCI configu- 
ration space so that we can access the INIC registers, 
allocating map registers used to obtain physical buffer 
addresses, allocating non-cached shared memory for the ISR 
and other data, allocating pools of buffer and packet 
descriptors, allocating spinlocks, and registering the inter- 
rupt (IRQ) obtained from the PCI configuration space. 

Note that we do not allocate SDBs at this time. SDBs are 
allocated on an as-needed basis and consequently are not 
allocated until the card is initialized and we are prepared to 
pass buffers off to it. 

At this point in the initialization process the INIC hard- 
ware is initialized. When we begin interface initialization we 
check the state of the card (contained in the SIMBA_CARD 
structure). If the state is down (probably because we are the 
first interface on the card to be initialized), then we must 
i perform INIC card initialization. 

The first step in initializing the card is to reset and 
download the card. The reset is performed by writing to the 
reset register. This is a reliable hardware register, as opposed 
to one serviced by firmware. After reset the firmware on the 
card is running out of ROM. The ROM based firmware 
provides very little functionality besides assisting in the 
write-able control store download 

The firmware that is to be downloaded to the INIC is 
compiled into the driver as a set of static character arrays. 
These static arrays are found in the file simbadownload.c, 
which is created by the objtoc utility. Objtoc is an Alacritech 
utility used to convert metastep assembly code output to 
static arrays, each array representing a single contiguous 
block of firmware. 

The download is performed by a series of three writes to 
the WCS register on the INIC. The first write is the address 
to be loaded, the second write is the bottom four bytes of the 
instruction, and the third write is the top fourbytes of the 
instruction. We simply walk through each static array down- 
loading the data contained in the array. 

Note that the top bits of the address written in the first 
write to the WCS carry special meaning. Bit 30 tells the 
firmware to compare the instruction contained in the next 
two writes to the instruction already contained in the speci- 
fied address. This is used to ensure that the download 
completed correctly. We first download all of the code, and 
then we download it all again with the bit 30 set in the 
address words. If the firmware discovers an error, it will 
place the address of the bad instruction into location zero of 
SRAM. After each "compare" sequence, the driver checks 
the location to determine if there was an error. If so, the 
driver fails the initialization. Bit 31 of the address word tells 
the firmware to jump to the specified address. We set this bit 
after the firmware has been successfully downloaded to start 
the norma] INIC operation. 

The INIC contains a single structure representing the 
configuration of the card. This structure typically resides in 
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EEPROM or FLASH. The structure contains, among other 
things, the DRAM size of the INIC, the SRAM size of the 
INIC, and the MAC addresses of the adapters contained on 
the INIC. 

This information is fetched from the INIC by issuing a 
Utility Processor Request (UPR) to the INIC firmware 
(UPRs are described below). The data returned by this UPR 
is contained within a shared memory structure pointed to by 
the SIMBA_CARD structure. 

Once the INIC has been initialized, we can initialize a 
particular adapter on the card. This is done as follows: 

At initialization time we queue the INIC with a set of 
header, data and response SDBs. We also pre-allocate a set 
of command SDB's and another set of free SDB's to avoid 
experiencing delays when we need to acquire more SDB's. 

In order to configure a particular MAC interface on the 
INIC, we must first obtain information about the state of the 
PHY. We do this by issuing a Read Link Status Register 
(RLSR) UPR to the INIC firmware. This command com- 
pletes asynchronously. When it completes, we save the 
information returned to us into the ADAPT structure. This 
information includes the link speed (10/100 mb), the 
"duplexivity" of the link (half/full), and the state of the link 
(up/down). With this information, we can configure the 
MAC configuration register, the MAC receive configuration 
register, and the MAC transmit configuration register. We 
also configure the MAC address registers with either the 
information returned from the INIC Configuration UPR 
described above, or, if the administrator has specified 
another MAC address, we use the specified address instead. 

For a Device Reset, the NDIS defines two miniport driver 
entry points used to identify and reset a dead adapter. The 
MiniportCheckForHang routine is called periodically to 
check on the health of the adapter. If the adapter is sick, the 
CheckForHang routine returns true. Returning true causes 
NDIS to send a reset status indication to the bound protocol 
drivers, and to also call the driver's MiniportReset routine. 
The MiniportReset routine is responsible for restoring the 
adapter to an operational state. If the MiniportReset routine 
returns good status, the adapter is assumed to be back online, 
and NDIS will send a reset-complete status indication 
upstream. 

SimbaCheckForHang is the MiniportCheckForHang rou- 
tine for the INIC driver. SimbaReset is the MiniportReset 
routine for the INIC driver. 

Some unique challenges are associated with a card reset 
for the present invention. First, as far as is known, we are the 
only multifunction network device in existence. There are 
other four-port cards, of course, but they are typically four 
individual devices behind a PC1-PCI bridge. Because our 
four adapters are all associated with a single device, we 
cannot reset a single adapter. Yet, since the CheckForHang 
and Reset functions get called for each of the four adapters 
on the card, if we determine that the card needs to be reset 
then each of the four driver instances must recognize that a 
reset has taken place and perform initialization. This poses 55 
a synchronization problem. The card, of course, must only 
be reset once. After reset, the card must be initialized once 
and all four adapters must be initialized individually. To 
ensure that all four instances of the driver recognize that a 
reset has occurred, and to perform re-initialization, we set 
the adapter state in the ADAPT structure to ADAPT_ 
RESET for each interface on the card. When the Check- 
ForHang function is called for each interface, it will check 
the adapter state to see if a reset has occurred. If the adapter 
state is set to ADAPT__RESET, it returns true. 

A second challenge is core dumps. Most NICs have little 
or no software or firmware running on the card. We have a 
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substantial amount of firmware, and like any other code, it 
is subject to bugs. When the card becomes non-operational, 
there is a good chance that it is the result of a firmware bug. 
We have interactive debuggers that can be used internally to 
diagnose the cause of a INIC crash, but there may be times 
when it is essential that we be able to dump the state of the 
card after it has crashed. 

In this situation, the contents of the INIC registers, 
SRAM, DRAM, and possibly some queues will be dumped 
from the card. Since this amounts to many megabytes worth 
of data, we will need to move the data in blocks from the 
INIC to a file. NT allows a driver to create and write to a file 
from the kernel, but it must be done at passive level. Both 
the CheckForHang and the Reset routines run at dispatch 
level. 

To get around this problem, we have introduced a "dump 
thread". The dump thread is a kernel thread that is started 
whenever a card structure is allocated (i.e. one thread per 
card). In the INIC driver, it is actually the dump thread, not 
that CheckForHang routine that monitors the state of the 
card. We have the dump thread do this because we want to 
be able to dump the contents of the card before a reset is 
initiated. The dump thread, upon finding a card dead, will 
attempt to perform the dump, and then it will set the card 
state to CARD_DOWN and set each adapter state to 
ADAPT— RESET. WheQ (he checkForHang romine &r 

each adapter finds the adapter state set to ADAPTJRESET, 
it will return true as described above, to begin the 
re-initialization process. 

The MiniportShutdown routine for the INIC driver is 
defined SimbaShutdown. It is called at system shutdown 
time so that we can put the INIC into a known state. We 
simply issue a reset to the INIC when the shutdown routine 
is called. 

SimbaHalt is the miniport halt routine for the INIC driver. 
It is called with the driver halted. It is responsible for freeing 
all of the resources associated with the particular adapter 
that is being halted. A trick with this routine is to keep track 
of which adapters on a particular INIC card have been 
halted. The last adapter to be halted must also free resources 
allocated for the INIC card (the SIMBA_CARD structure 
and the shared memory used to contain the INIC 
configuration). We keep track of which cards have been 
halted in the SIMBA_CARD structure. 

SimbaQuerylnformation is the MiniportQuerylnforma- 
tion routine for the INIC driver. SimbaSetlnformation is the 
MiniportSetlnformation routine for the INIC driver. 

At present we support all of the required OIDs in the INIC 
driver. We have also added the following custom OIDs- 

OID_SIMBA_ADD_IPADDR— Sent down from the 
ATCP driver to register an IP address with the INIC driver. 
The INIC driver uses these addresses by determine which 
way to direct incoming traffic. This is discussed further 
below. 

OID_SIMBA_REMOVE_IPADDR— Used to remove 
an IP address added with OID_SIMBA_ J ADD_JPADDR. 

OID_SIMBA_GET_STATS— A query from the ATCP 
driver to obtain statistics maintained on the INIC adapter. 
This is discussed further below. 

OID_SIMBA_ATK_GLOBALS— Sent down from the 
ATCP driver to pass shared memory information. This is 
primarily used for tracing. This too is discussed further 
below. 

For message transmission, we label the MiniportSend- 
Packets routine for the INIC driver SimbaSendPackets. It is 
called with an array of NDIS packet descriptors, which have 
been passed down to us by an upper level protocol driver 
(ATCP, MS TCP, IPX, etc). 
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For each packet contained in the array of packet descrip- 
tors we perform the steps described below. 

First, we check for errors There are a number of reasons 
why we may fail to send a packet. The reasons that we may 
fail to send a packet are as follows; 

Microcode diagnostics are enabled — We provide a set of 
diagnostics that allow us to exercise the IMC microcode in 
a controlled loop-back environment. If these diagnostics are 
enabled, then we do not allow any of the standard protocol 
routines to send data. We fail immediately by setting the 
packet status to NDIS_STATUS_FAILURE. 

Link or Adapter State is not up — If the Link State or the 
Adapter State is down, we cannot send any packets. We fail 
immediately by setting the packet status to NDIS_ 
STATUS_FAILURE. 

Zero-length packet — Strictly speaking, this is not an error. 
If we encounter a zero length packet we complete it suc- 
cessfully immediately. 

Insufficient map registers — We need a map register for 
every NDIS buffer in the packet. If we do not have enough 
map registers, then we can not send the packet. We fail 
immediately by setting the packet status to ND1S_ 
STATUS_RESOURCES. 

No command buffer — If we need a command buffer and 
cannot allocate one we fail immediately by setting the 25 
packet status to NDIS_STATUS_RESOURCES. 

The code that interacts with the I NIC hardware is sepa- 
rated from the code that interacts with NDIS. The code that 
interacts with the INIC hardware is contained in the Sim- 
baTransmitPacket routine as a call SimbaTransmitPacket. 30 
We separate it in this manner so that the microcode diag- 
nostics (which run outside the context of NDIS), can share 
the same transmit code as the normal path. 

Cto^nmand.buflfers,contain-many-d^eren^types-of-mfo^- 
mation. For slow-path frames, for example, co mmand' L buff- 35 
ers'contain-mformation'about"th e~a"ddress and'l ength of the 
frame-. to.be.sentiiGom mand buffers may also he used to band 
a contexjfg^UieATCRdriver- out-to-mej 
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a fl iish-of-a^conte xt;from^th"e~lNTC. For these and other 
P_Hni9^>^^^ c ^^ v er _ neeasj^beIaBle-to-set_the 40 
command-bjrtfejMip itsetf^Thu ^cdmmand buffers m avjej 
allocated in" t wo aieas rAnyisallslo^mbaSendfacketsTfrom 
me^A TCP'dfiver containaSATCP allocated c ommand buffer. 
Any;caIls2ronrotoerdr^ 

raw data (a network frame). For the calls thlrtalreTdycontaTn 45 
a command buffer, we must simply pass the command buffer 
off to the INIC. For other calls, we must allocate our own 
command buffer and configure it appropriately. 

It is thus important to identify whether or not a packet is 
a normal frame or, as shown in FIG. 15, the packet 360 so 
contains an ATCP command buffer 362. Our solution to this 
is in the ATCP driver we prepend an ethernet header 366 to 
the command buffer. This ethernet header is located in a 
separate chunk of memory (with a separate NDIS buffer 
descriptor) and contains an elhernet type field of 0x666. This 55 
value was chosen not only because of its spooky nature, but 
also because it is too large to be an 802.3 length, and too 
small to be a valid ethernet type field. It is a value that we 
never expect to see handed to us in a frame from MS TCP, 
IPX, Appletalk, or any other protocol driver. 60 

Sending command buffers that have been given to the 
INIC driver by the ATCP driver is relatively simple. The 
INIC driver maps the command buffer to obtain the com- 
mand buffer's physical address, flushes the command buffer 
and hands it to the appropriate transmit queue on the INIC. 65 

Several types of commands may be sent to the receive 
processor of the INIC instead of the transmit processor. 



These commands include the release context command 
(IHCMD_RLS_CTXT) and the receive MDL command 
(IHCMD_RCV_MDL). The INIC driver examines the 
command buffer and if the command is one of these types, 
bands the command buffer to the receive processor. 

Note that the INIC driver does not reference any fields in 
the command buffer after it has been flushed. 

As illustrated in FIG. 16, normal frames can contain any 
number of buffers with many different types of data such as 
buffer 1 370 and buffer 2 371. When the INIC driver receives 
a normal frame it first allocates and initializes a command 
buffer 373 of its own. The INIC driver obtains this from the 
CmdQueue in the ADAPT structure. flt-then-maps'evefyi 
biuTer_desOTptor.attached.to the-packet to obtain the physical 
[address of the buffer, and then fills in.the command buffer 
.withithesephysical addresses, e.g. buffer descriptors'374'and 
3 v 75 for fjrames ; 3J6^dr373s=== =J 
"-The-INIC'dnver also flushes each buffer associated with 
the packet to maintain cache coherency. After we have filled 
the command buffer in with the complete list of buffers, we 
must then map and flush the command buffer itself and hand 
the physical address of the command buffer off to the INIC. 

After we have sent all of the packets in the packet array, 
we check to see if we have exhausted any of the command 
SDBs. If so we attempt to allocate replacement SDBs and 
requeue them to the CmdSDB queue. 

After a command issued to the INIC has completed, the 
resources held by the command must be freed and the 
corresponding send, which initiated the command, must be 
completed. This is performed in the SimbaXmtDpc routine. 

As described above, commands are completed by the 
INIC by filling in a response buffer. The reason that we do 
this instead of simply setting a completion flag in the 
command buffer is because commands can be completed out 
of order. Some commands, like one to transmit 64 k of SMB 
data, will take much longer than a command to transmit 100 
bytes of data. 

The_command-buffer-contains-a-"HostHan"dle" field, 
which^is filled in with th e virtual address o f.the-command 
buffer. When a commarfd comp letes, the-INIC firmware puts 
this adjJressjnto-the-Tesponse buffer 0 

Response buffers are returned to us in order, so the first 
thing that the SimbaXmtDpc routine does is to locate the 
next expected response buffer. If the status indicates that it 
has been filled in by the INIC, we locate the completed 
command buffer from the HostHandle field. 

At the end of the command buffer, we keep a bunch of 
information that is not passed to the INIC. Among this 
information is the list of map registers used to obtain 
physical buffer addresses. We use this list to free the map 
registers. 

In the Transmit description above two types of sends are 
mentioned, one in which the ATCP driver allocates a com- 
mand buffer, and another in which the INIC driver allocates 
a command buffer. Clearly, if the INIC driver allocated the 
command buffer, the INIC driver must also free it, yet if the 
ATCP driver allocated it, the INIC driver must not. We 
determine this by saving a pointer to the SDBHANDLE at 
the end of the command buffer. If it is an ATCP driver 
allocated command buffer, there will be no SDBHANDLE 
set in the psdbh field of the command buffer. 

NDIS defines two routines used in interrupt handling. The 
first is the MiniportlSR routine. It is called at interrupt level 
and its purpose is to determine if the interrupt is associated 
with its device and if so, mask the interrupt and tell NDIS 
to schedule the MiniportHandlelnterrupt routine. The 
MiniportHandlelntemipt routine runs at DPC level and 
performs the bulk of the interrupt processing. 
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SimbalSR is the MiniportlSR routine for the INIC driver. 
SimbaHandlelnterrupt is the MiruportHandlelnterrupt rou- 
tine for the INIC driver. 

Note that most PCI devices maintain an interrupt status 
register on the card. When an interrupt occurs, the driver 
must read the value of the ISR from PCI. Since reading data 
from the PCI bus is an expensive operation, we sought to 
optimize this by putting interrupt status in a host-memory 
based interrupt status "register". This memory-based ISR is 
contained in the non-cached shared memory region allocated 
per interface. 

There are some concerns however when using a memory- 
based ISR. Race conditions can occur when the host driver 
is clearing status, while the INIC card is writing status. To 
keep this from happening, we have introduced a strict 
protocol. The INIC is not allowed to write to status to the 
memory-based ISR until the driver has responded to previ- 
ously written status. 

The SimbalSR routine first checks its memory-based ISR 
to see if any events are set. If not it tells NDIS that it does 
not own the interrupt. Otherwise, it sets the contents of the 
memory-based ISR aside, zeros the memory-based ISR and 
masks interrupts from the INIC. Note that even though we 
have masked interrupts, our ISR routine may get called 
again as the result of an interrupt being generated by another 
device that shares the same interrupt line. For this reason, we 
zero the memory-based ISR to prevent us from getting 
confused. 

The SimbaHandlelnterrupt routine checks every possible 
bit of the interrupt status that we set aside in SimbalSR, and 
responds appropriately. This includes calling the Sim- 
baRcvDpc routine in the case of a receive event, SimbaXmt- 
Dpc in the case of a command completion event, etc. 

After all of the events have been processed, we clear the 
events on the INIC by writing to the interrupt status response 
register. This will clear the way for the INIC to send us new 
status. We then also unmask the interrupts. Note that we 
must not unmask the interrupts until we write to the interrupt 
status response register. Otherwise, the INIC will give us an 
interrupt for the events that it has already sent to us. 

Receive data is passed_fronUhe.-INI©to thehost-by-filling 
mfTheader : b"ulierrThe header j^ffer_contMrjs _information j 
ab.out.the datai-s^A^tBelengtoijf-the datais smallenough, 
the headerbufferalso contains the dataitself.-OtherwiseJthe 
data.is.contained-m _ a _ corresponding data buffer. If the data 
resides_in-a^da^buffer,-th^hTalJe^^ contain a 

pomterto'th^SP BHANDtE gtg ^re_assoqale'd jcvitrj* the 
data-bufienjEurthermore, thToffset of the buffer withinlh'e 
SDB"is placed in the bottom_bits-of-the-pointeT~to"Ahe 
SDBHANDtE-sffiicrure: 

FIG. 17 shows an example of a receive header 400 and 
data buffer 402. In this example, the buffer field 404 of the 
header buffer 406 contains the address of the data SDB- 
HANDLE 408 structure (0x1000) with the bottom bit set to 
indicate that the data buffer is at offset 1 within the two part 
data SDB. 

After the INIC fills in the header, and possibly data buffer 
411, it notifies the host by setting the ISR_RCV bit in the 
memory based ISR and raises an interrupt. The SimbaHan- 
dlelnterrupt routine in the driver calls the SimbaRcvDpc 
routine to process the received frames. 

The INIC driver maintains a queue of header SDBs, each 
of which contain 16 header buffers. The head of the Hdr- 
Queue is the current SDB being worked on, and the SDB- 
HANDLE structure for that SDB contains the offset of the 
next expected header buffer within the SDB (header buffers 
are returned to the driver in the order that they were 
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presented to the INIC). Each valid header buffer found by 
the INIC driver is dequeued and processed. A buffer is 
dequeued by bumping the offset within the SDBHANDLE 
structure. Once all of the header buffers within an SDB are 
exhausted, we dequeue the SDB from the HdrQueue and 
start over again at the top of the next SDB. A header buffer 
is determined to be valid by checking the status field. The 
status field is cleared when the header buffers are passed to 
the INIC. The INIC sets valid bit of the status field when a 
buffer is returned to the INIC driver. 

If an error exists in the frame, the INIC sets the 
IRHDDR_ERR bit in the status word, and forwards the 
receive frame status words to the host. These status words 
are generated by the INIC hardware and placed in front of 
the receive frame. For more details regarding these status 
words, refer to the sections regarding the INIC hardware 
specification. 

In the event of an error, the SimbaRcvDpc routine incre- 
ments the appropriate statistics field in the ADAPT structure, 
and then drops the received frame. 

If the INIC driver receives a normal network frame, it 
needs to ensure that it is configured to receive the frame. We 
do this by calling the SimbaMacFilter routine. If we are 
running in promiscuous mode, then this routine always 
returns true. If the destination MAC address equals our 
MAC address and we are configured for directed mode, then 
we also return true. Or, if the destination MAC address is a 
broadcast address and we are configured to receive broad- 
cast packet, then we return true as well. 

Multicast frames are a bit trickier. When the INIC driver 
receives a OID_802_3_MULTICAST OID in the Sim- 
baSetlnformation routine, it downloads a 6-bit hash of the 
multicast address to the INIC firmware. This 6-bit hash is 
generated by computing the 8-bit CRC polynomial gener- 
ated by the MAC core and masking off the top two bits. 
When the firmware hands a multicast frame to us, we must 
ensure that we are configured to receive the multicast frame 
by checking for a perfect match against our list of multicast 
frames. If a match occurs, and we are configured to receive' 
multicast frames, then the SimbaMacFilter routine returns 
true. 

^_JlKre;are-triree~types of receiverOirames tfiat wernust 
handle jn.the-SimbaRcvDparoutine: 1) fast p^thframes;(or 
me^ag^s)-2):sl6w^pTffi^eP fram^esT - and3) other frames. 

'Fast path frames are i dentifiedby the IRHDD R^_TVAfclD 
bit.in-the;Status-word~It means that the header 6uff er.(and 
p^sibly-data-buffer-as"^ve]l),_cpntains.a.frame-ormessage 
associated_wim:a-fast : plitrrconnection on the INIC.JJnde'r 
toesejycumstances.we-must send'the receivecfframe strictly 
to the ATCP driver. 

If the IRHDDR_TVAUD bit is not set, then the header 
buffer, or associated data buffer, contains a normal network 
frame. If the network frame is a TCP frame for one of the 
network interfaces in our system, then the INIC driver needs 
to send the frame up to the ATCP driver. This is a slow path 
TCP frame. Otherwise the INIC driver needs to send it up to 
the Microsoft TCP driver. Note that we only send the frame 
up to the ATCP driver if it is a TCP frame that is destined for 
one of our interfaces. We must check the destination IP 
address because if is not destined for one of our interfaces, 
then the frame needs to be routed. Frames that need to be 
routed are done so via the normal Microsoft TCP stack. Note 
also that we forward the frame up to the ATCP driver if the 
frame is destined for any interface in our system, not just the 
INIC interfaces. This is because if the frame came in on our 
interface, it is likely to go out on our interface. Under these 
circumstances, we must handle it in the ATCP driver. 
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Frames that are sent from the [NIC driver to the ATCP 
driver are done so by calling SimbalndicaleHeader. Frames 
that are sent up to the normal TCP driver are done so by 
calling SimbalndicateData. 

When we wish to send a frame up to the ATCP driver via 
a SimbalndicateHeader, we do so by sending up the entire 
header buffer. We do this because the header buffer may 
contain information that is important to the ATCP driver. In 
order to send the header buffer exclusively to the ATCP 
driver we have to do two things. 

First, in order to prevent the normal TCP driver (or any 
other driver) from attempting to parse the frame, we must 
make the frame look like something that it doesn't want to 
touch. Remember that the drivers above an ethemet miniport 
driver expect an indicated frame to begin with an ethernet 15 
header, and thus expect an ethernet type field to be at a 
twelve byte offset within the frame. We trick the other 
protocol drivers by setting this "type" field to 0x666 (the 
same value used to identify an ATCP command buffer in the 
transmit path). 20 

The second thing that we must do is to get the frame past 
the NDIS filtering. NDIS performs ethernet frame filtering 
for us. If we send up a frame that does not have the 
destination MAC address field set to our interface's MAC 
address, NDIS will drop it. There are two ways to deal with 25 
this. The first is to set the NDIS filtering to promiscuous 
mode. This way all frames are received by the protocol 
driver. This is undesirable because NDIS will then forward 
all outgoing frames back up to the ATCP driver. The other 
way is to set the first 6 bytes of the header buffer (the 30 
destination MAC address) to our interfaces MAC address. 
While this does require a 6-byte copy for every frame 
received, this was determined to be the best approach. 

In order to indicate the header buffer, and possibly data 
buffer, upstream, we first map the buffers using NDIS buffer 35 
and packet descriptors. Recall that for each header buffer we 
pre-allocate an NDIS buffer and packet descriptor, while for 
each data buffer we pre-allocate just a buffer descriptor. We 
use these pre-allocated buffer and packet descriptors here to 
map the buffers) and send them upstream. FIG. 18 illus- 40 
trates the relationship between all of these structures. 

When indicating data we only,want-tOjsend-up _ the~aata 
portion ofjhe.frame-alone7Recall that data c an.either-reside 
in the r header buffer-itselfrif "it~is smalLenough,- or-in-an 

associatedidata buffer."If the'data resides in the header buffer, 45 

(- - - ~ *' 

then we adjust the buffer_descriptor such'that'it p oints to th e 
data portion~of tfie~header buffer.(beneath"the status word, 
etc).Xonveisely,-tf the^ataj^Mes.m the-data"buffer, we use 
me'-buffer-descriptorassociated with.the-data'buffer.topoint 
to the data^uffer,-and^we_use-the packeTclescTiptor associ- 50 
ated^wilh-the-heTder buffer to ppjnt_to_the-data-buffer-, 
descri ptor._ After-settmg~eveo^ing-up.--we-then-free-the 
header-buffer,-an"d"tfie~buffer descriptor associated with it. 

Once we have completed processing incoming data, we 
replace any completed header and data SDBs by issuing new 55 
SDBs to the INIC. Note that we do this immediately, rather 
than waiting for the ATCP or other protocol driver to return 
the buffers to us. 

In NDIS version 4, there are two ways in which a miniport 
driver can indicate data to a protocol driver above it. The 
first method is performed by calling NdisMEthlndicateR- 
eceive. With this method, the data passed up is copied 
immediately into memory allocated by the protocol driver. 
Thus, when the call is complete, the memory used to contain 
the data can be freed. While this is simple from a resource 
management perspective, it is horribly inefficient The sec- 
ond method is performed by calling NdisMIndicateReceive- 
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Packet. With this method, the packet passed up is held by the 
protocol driver until it has completed processing the entire 
data. With this method, we need a way of returning the 
completed packets back to the miniport driver so the 
miniport driver can free the memory. This is done via a call 
to NdisReturnPackets, which results in a call to the Minipor- 
tRetumPacket handler. 

SimbaReturrPacket is the MiniportRetumPacket handler 
for the INIC driver. Note that the packet being returned to us 
contains a header and/or a data buffer. As we described 
above, in order to free a header buffer or data buffer, we must 
have a pointer to the SDBHANDLE structure, and we must 
also know the buffer offset within the SDB. Both of these 
values are saved in the ProtocolReserved field of the packet 
descriptor. The ProtocolReserved field is a section of 
memory within the packet descriptor that is reserved for use 
by the miniport driver. 

To send and receive frames (and commands) from the 
INIC we use the mechanism described above regarding the 
host interface strategy for the Alacritech INIC. Beyond this, 
however, we also need a mechanism to receive other mis- 
cellaneous types of information from the INIC. This infor- 
mation includes statistics, link status (discussed above), and 
INIC configuration information. To perform this function, 
we use a set of commands called Utility Processor Requests 
(UPRs). 

UPRs are handled exclusively by the utility processor on 
the INIC. Each UPR command is initiated by writing to the 
corresponding register on the INIC. The address written to 
the register tells the INIC where to place the data. For 
example, if we wish to fetch the INIC configuration from the 
INIC, we write the address of the INIC configuration shared 
memory space to the RCONFIG register of the INIC. 

UPRs complete asynchronously by setting a bit in the 
ISR, and setting an interrupt. Because there is no identifi- - 
cation as to which UPR has completed, we only keep one 
UPR outstanding per interface at any given time. If a UPR 
is already in progress, then a subsequent UPR will be queued 
behind it. When the pending UPR completes, the queued 
UPR will be issued. 

NT requires that an NDIS driver provide the following 
statistics: successful transmits, successful receives, transmit 
errors, receive errors, dropped receives (no buffer), and 
transmit collisions. 

The majority of these statistics are maintained on the 
INIC. When the INIC driver receives a Querylnformation 
call for one of these statistics, we issue a stats UPR com- 
mand to the INIC and return pending. When the UPR 
completes we in turn complete the pending Querylnforma- 
tion call with the requested information. 

The Microsoft stack maintains a number of statistics 
about each interface such as multicast receives, broadcast 
receives, unicast receives, multicast transmits, etc. It also 
maintains TCP level statistics such as the number of seg- 
ments sent and received, and the number of TCP bytes sent 
and received. Since the INIC offloads the TCP stack from the 
NT system, we can not maintain these statistics in the ATCP 
driver. Instead, we maintain most of these statistics on the 
INIC itself. When the ATCP driver requires these statistics, 
it issues an OID_SIMBA_GET_STATS OID to the INIC 
driver. The INIC driver again fetches these statistics by 
issuing a UPR to the INIC and returns the statistics back to 
the ATCP driver. 

The INIC keeps track of received TCP segments and bytes 
by simply looking at the protocol field of the IP header. It 
does not, however, examine the destination IP address. Its 
possible that one of the received TCP frames may need to be 
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forwarded back out another interface. In this case, the frame 
never reaches the TCP layer, and thus, it should not be 
reflected in the TCP statistics. We adjust for this in the INIC 
driver when we discover that a frame is not associated with 
any of the interfaces in our system. 5 

There are a number of other statistics that we maintain in 
the ADAPT structure explicitly for the purpose of debug- 
ging. These include counters of just about any error 
condition, or error frame encountered in the INIC driver. It 
also includes various other counters, such as interrupt and to 
event counters, that we may use later to tune and optimize 
the driver. 

Two families of diagnostics are specified, hardware 
diagnostics, and firmware diagnostics. The hardware diag- 
nostics are split into several applications, engineering hard- 15 
ware diagnostics, manufacturing diagnostics, and customer 
diagnostics. 

Each of the firmware and hardware diagnostic applica- 
tions requires a way to communicate directly with the INIC. 
To do this we provide a set of standard device driver entry 20 
points in our INIC NDIS driver. We accomplish this by 
saving the NDIS entry points that are found in our Driver- 
Object structure, and replacing them with our own open, 
close, write, read, and ioctl routines. When one of these 
routines is called, we check the device extension of the 25 
device object that is associated with the call. If the extension 
is not associated with our diagnostic device, then we pass the 
call off to the saved corresponding NDIS routine. Otherwise 
we intercept and handle the call directly. 

The firmware diagnostics provide a mechanism to exer- 30 
cise and verify some level of INIC microcode functionality. 
By putting the INIC into loopback mode, we can send and 
receive slow-path frames. To ensure that we are exercising 
as much of the final product as possible with these 
diagnostics, we also use the majority of the INIC driver 35 
transmit and receive code. 

To send data passed down from the diagnostic application, 
we allocate a chunk of memory used to contain the users 
data, and another chunk of memory, which we will use as a 
command buffer. We copy in the user's frame our allocated 40 
memory and initialize the command buffer. We then map the 
command buffer and a statically allocated ethernet header 
with NDIS buffer and packet descriptors and call the Sim- 
baTransmitPacket routine to send the data. 

Note that in allocating our own command buffer and 45 
pre-pending it with a separate ethemet header (containing a 
type of 0x666), we are pretending to the SimbaTransmit- 
Packet routine to be the ATCP driver sending down a 
command buffer (see the ATCP Command Buffer descrip- 
tion above). 

The SFWDiagSend routine will return success to the user 
immediately, rather than waiting for the INIC to respond to 
the command buffer. This allows the diagnostic application 
to get many transmit commands operating in parallel. 

When we receive a command completion event from the 
INIC, the SimbaHandlelnterrupt routine calls the Sim- 
baXmtDpc routine. If the SimbaXmtDpc routine finds that 
diagnostics are currently running, it will pass the completion 
off to the SFWDiagSendComplete routine. The SFWDiag- 
SendComplete will simply free the resources allocated by 60 
SFWDiagSend. 

When we are running in diagnostic mode, the Sim- 
baRcvDpc routine calls the SFWDiaglndicateData routine 
instead of NdisIndicateReceivePacket when a packet has 
arrived. The SFWDiaglndicateData routine places the 65 
received packet on a queue and issues an event to wake up 
any thread that might be waiting in SFWDiagRecv. 
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The SFWDiagRecv routine is called by the diagnostic 
application to receive an expected frame. It waits for a 
received frame to be indicated by the SFWDiaglndicateData 
routine and then dequeues the frame from the diagnostic 
receive queue. The data contained in the packet is copied out 
to the user, and the packet is then returned by calling 
SimbaRetumPacket. 

Hardware diagnostics are used to verify the functionality 
of the INIC hardware. To do so requires that we run special 
diagnostic microcode on the INIC. When hardware diag- 
nostics are initiated, the INIC driver resets the INIC card and 
downloads the diagnostic microcode. After the user exits 
hardware diagnostics, the INIC is put back into operational 
mode by downloading the standard microcode and 
re-initializing the card and interfaces. 

Nearly every function entry and exit in the INIC driver 
can be traced using the S1MBA_TRACE tracing facility. 
Furthermore, every notable event, such as an error, is traced 
as an important or critical trace event. The SIMBA_TRACE 
facility keeps a circular log of trace events in system 
memory. It can be disabled with a compile time option so 
that in the production driver there is no performance impact. 

The S1MBA__TRACE facility is set up so that a common 
buffer is used to track events from both the ATCP and INIC 
driver. This is achieved by passing the common buffer 
address using the OID_SIMBA_^ATK_GLOBALS set 
OID. 

For installation, the INIC driver searches for newly 
installed cards by calling the GetPCIInformation utility with 
the vendor and device ID of the INIC device. For each 
four-port INIC, GetPCIInformation should return four sepa- 
rate devices, each with a unique function number (0-3). For 
each device returned by GetPCIInforrnation we must check 
to see if it is already installed before proceeding with the 
installation. Typically this would be as simple as calling the 
IsNetCardAlreadylnstalled utility, but Microsoft apparently 
thought that no one would ever write a multifunction net- 
working card, so they didn't put multifunction support in the 
utility. We have then combined the functionality of the 
IsNetCardAlreadylnstalled utility and support for multifunc- 
tion devices to our own version of the utility. 

Having determined that we have not already installed the 
device, we set the CardBase to the slot number of the card, 
with the high order bits set to the bus number. This is 
somewhat more confusing then setting it to the base network 
number, as is done in the VPCI phase, but it is more 
permanent in the event that an administrator starts installing 
and de-installing adapters. We also save the bus number, slot 
number and function number separately, along with the size 
of the card and the index of the adapter within the card. 

The bulk of the source code for the INIC driver is located 
in the driver/simba directory in the source tree. Other 
miscellaneous header files are scattered about in other 
directories as specified below. 

The following files are found in the Simba source direc- 
tory: 

simba.c — Contains the DriverEntry routine for the INIC 
driver, 

simba.h — The main header file for the INIC driver, this 
contains the SIMBA_DRIVER, SIMBA_CARD and 
ADAPT structures, as well as many other structures 
and definitions, 

simbamini.e — The miniport entry points and related 
functions, 

simbamisc.c — Miscellaneous support routines for the 
INIC driver. Contains most of the initialization and 
buffer management code, 
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endiaah— Endian swapping definitions used when pars- FSM=Finite state machine; a state/event matrix giving 
ing network frame headers, action nexl state . ^ ^ 

simbadownload.c— The microcode download for the ISR=Interrupt Status Register- 

INIC. This is a two-dinensional statically defined LRU-Least Recently Used; 'used in the SRAM CCB 
character array generated by the objtoc utility, s buffef cache . 3 ' lu <»kaivi 

simbaproto.c — The protocol driver routines for an initial MC=Multicast frame- 

(VINIC) phase of the driver, this is not included in an MCC w • ' . . 

intermediate (VPCI) or FINAL phase of the driver, MSS=Maximum segment size; 

_ • 1-u c , PST=Persist timer; 

vpci.c — The entry points for VPCI requests, „™ 

, io RTRoRetransmission timer. 

^WcTn^reaue s r * ** * bel ° W > ^ IN1C has 3 set ° f 3 ~ 

™™ ne ™ or */ et J uests processors (CPUs) that provide considerable hardware- 

ne2000.c— The DnverEntry and mmiport routines for the assist to the microcode running thereon. The following 

ne2000 portion of the INIC VPCI driver, paragraphs list the main hardware-assist features. 

ne2000sw.h — The main software definitions for the 15 The INIC has 32 hardware queues whose sizes are user- 

ne2000 portion of the INIC VPCI driver, definable; they can be used in any manner by the CPUs (and 

ne2000hw.h— The hardware definitions for the ne2000 me hardware) for passing 32-bit pointers/events around 

NIC card.c — Low-level ne2000 network card routines, without interlock issues when adding or removing entries 

interrupLc— Interrupt, transmit and receive routines for ?° m ,he queUes < eg ' DRAM free - buffe r queues, receive- 

the ne2000 portion of the INIC VPCI driver, 2 0 fran ? e ^ s 

.• u , . _ ,. J he INIC also has a Receive hardware sequencer that 

diag.c-Haniware and firmware diagnostic routmes, completely vaIidates m ^ hea(Jer as ^ ^ fc J? 

rnag-n— Definitions used by the diagnostic routines, received by the MAC, validates TCP and IP checksums, 

diagdownload.c — The diagnostic microcode download generates a frame status and a context lookup hash, moves 

for the INIC. Also a two dimensional array generated 25 the frame into a DRAM buffer and queues the frame address 

by the objtoc utility, and status for processing by the Receive CPU into one of the 

oemsetup.* — installation scripts for the VINIC, VPCI, hardware queues mentioned above. 

and FINAL phase of the INIC driver, ' A set of Transmit sequencers work from the above- 

precomp.h— Precompilation header file containing all of mentl0ned queues to transmit frames. Like the Receive 

the included header files 30 se q ue ncers, there is one Transmit sequencer per interface. 

sources.^ompilation directives for the VINIC, VPCI fc ^IcSTwh^T f ""h TT* \ 

and FINAL phase of the driver, L P r„ h Y 1,°*?* ^ g Sbared W 3 3 " leVel 

v ' pipelined architecture. The protocol processor provides 

update.bat— A quick and dirty batch file used to update separate instruction and data paths to eliminate memory 

dnvers on a test machine, 35 contention. 

buildit.bat — A quick and dirty batch file used to build and Multiple register contexts or process slots are provided 

install the INIC driver, with register access controlled by simply setting a process 

config.bat — A quick and dirty batch file used to configure register. The protocol processor provides 512 SRAM-based 

the INIC driver for the VINIC, VPCI, or FINAL phase. registers to be shared among the 3 CPUs in any way desired. 

Other relevant files include: 40 The current implementation uses 16 processes of 16 regis- 

driver/include/simbahw.h— Contains definitions about the ten ! eacb ' leaving 256 scratch re &sters to be shared. This 

INIC hardware and firmware, includes a set of CPU-specific registers that are the same 

rinwrM,,^/^ h r\,„f,- i n/^i c • local-cpu register number, but for which the real register is 

Z^tJ^ g configuration determined by an offset based on the CPU numbTr; this 

space aennitions, 4S allows mumple cpUs (Q execute , he ^ cQde M ^ same 

driver/include/simbamisc.h— Contains miscellaneous time without register clashes or interlocks. These registers 

software definitions shared between the INIC and are a part of the above-mentioned scratch pool. 

ATCP driver, A specialized instruction set is provided to the CPUs to 

tools/diag/include/diagctl.li — Contains definitions shared as sis' network processing: endian-swap instructions, a hash 

between diagnostic applications and the diagnostic 50 instruction to generate hash indexes, embedded interlocks 

portion of the INIC driver, and instructions to set them, and a hardware-implemented 

tools/include/vpci.b — Contains definitions about VPCI LRU mechanism, 

commands. Shared by other VPCI users such as the Seven separate DMA engines are built into the INIC 

AGDB utility. hardware. The one to be used at any time is defined by the 

The next several pages describe the design of the micro- 55 so""* and destination e.g., from SRAM to PCI, from 

code that executes on the microprocessors of the INIC. The DRAM to SRAM; the DMA works off 32 descriptors in 

overall philosophy of the INIC is discussed above, while the SRAM, and at present, the code allocates one descriptor 

detailed configuration is described below, leaving this sec- permanently to each process. Completed DMAs are deter- 

tion to discuss the INIC microcode in detail. mined bv simply inspecting the Channel Events register. 

The following acronyms are denned for the INIC micro- 60 ^ folowing design choices were made in the current 

code: implementation. RECEIVE processing is run on one CPU, 

ACK -Transport layer acknowledgement: TRANSMIT processing on another and the third CPU is 

BC=Broadcast frame- used as a UTILITY and DEBUG processor. Splitting receive 

' and transmit was chosen as opposed to letting 2 CPUs both 

CCB-Communications Control Block; a block of control 65 run receive and transmit. Initially one of the main reasons 

information passed between the host and the INIC to for this was that the planned header-processing hardware 

control a connection; could not be shared and interlocks would be needed to do 
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this. However, the receive hardware CPU now runs com- that will be queued to the Q_FREEL hardware queue These 

pletely independently, and passes frames to the Receive queues are also used to control small host buffers large host 

CPU via a hardware queue described above, rendering the buffers, command buffers and command response buffers 

above issue moot. A good reason now for separating the events from one CPU to the other, etc. Each CPU handles its 

processor functions is that parts of the code depend on the s own timers independenUy of theothers; there arc 2 timer bits 

exclusive use of some shared resources by a particular CPU ;„ n,, r™ m i Ln,c •• , • , , i_ ' 

and interlocks would be needed on them. 1 1 fc expected that ? J^^T? rT Tl" ^ , 1 , ^ ^ 

the cost of all these interlocks would be fairlThigh, but l^Z^WlT fff*? Z nde f v; s ° Receive 

perhaps not prohibitive. Another reason is that theCPU " d ?T " ! u effeCtlVely 6ach have the,r own ,imer 

scratch registers have been carefully divided between the 3 10 b,L descnbed above > mnteMs ( CCBs ) « P**ed to the 

CPUs. If multiple CPUs executed receive processing for INIC xblon ^ the Transmit command and response buffers, 

example, then they would be using each other's scratch JNIC-imtiated CCB releases are handled through the 

registers. Receive small buffers. Host-initiated releases use the Com- 

The IN1C supports up to 256 TCP communication control mand buffers - There is strict handling of the acquisition and 

blocks (CCBs). A CCB is associated with an input frame 15 release of CCBs to avoid windows where for example, a 

when the frame's source and destination IP addresses and frame is received on a context just after the context was 

source and destination ports match that of the CCB. For passed to the INIC, but before the INIC has "accepted" it, as 

■ speed of access, there is a CCB index in hash order in described in detail above. 

SRAM. The index can be searched based on the hardware- The initial implementation of the INIC may not handle 

generated hash to find a CCB entry that matches the frame. 20 T/TCP connections, since they are typically used for the 

Once a match has been found, the CCB is cached in SRAM. HTTP protocol and the client for that protocol typically 

There are up to 16 cached CCBs in SRAM. These cache connects, sends a request and disconnects in one segment 

locations are shared between both CPUs so that the CPU The server sends the connect confirm, reply and disconnect 

with the heavier load will be able to use more cache buffers. in his first segment. Then the client confirms the disconnect 

ThM e -are-8-header-bufrers-for- re ceive^nd.8_comma i id 25 This is a total of 3 segments for the life of a context. Typical 

^^^^ S ^^' mS ^^ SeS:0n: ^ data len ^ m ° n thc order ° f 300 byt« from the client and 

^ S ,.^ £ a ' each h / ad f^ om ^ and I buffer , ls l , n0 l S , , ' tl - 3 K from T" 6 INIC W P^vide as good an assist 

cally linKedrto^a^specific CCB buffer. Instead-me.link is L u . , . 

d^Licon-a-per-f^me^eT^rSnd-bSs. The-Sbr ^f^TT and Y****** 

this^amic^nan^n- ^e-explairied in l lteTlfctions. 30 ^ lat f * °^ d ^ * forwarded 

LT>W.basic-proca^imp^ a ™ th a 1 such as 1 _ when a &lled - m form * sent by the 

single-stack and a process model. The process model was ChenU h ° Wever that ^J* 10 wU1 ^PP 0 '' H 1 " 1 * ° ver 

chosen here because the custom processor design is provid- f normal , TCP connection m fast-path mode. Also note that 

ing near zero-cost overhead for process switching through la * r ™Pf™^°™ handle , TyTCP, SPX and UDP. 

the use of a process base register, and because there will be 35 _ TabIe ., 2 ™ an ^s SRAM requirements for the Receive, 

more than enough process slots available for the peak load. Transmlt and v ^ l V CPUs: 

It is also expected that all "local" variables will be held 

permanently in registers whilst an event is being processed. TABLE 2 

The features that provide this are: 

256 of the 512 SRAM-based registers are used for the 40 
register process slots. This is divided into 16 process slots of 
16 registers each. Then 8 of these are reserved for receive 
and 8 for transmit. A Little's Law analysis has shown that in 
order to support 512 byte frames at maximum arrival rate of 

4*100 Mbits, requires more than 8 jobs to be in process in 45 
the NIC. Each job requires an SRAM buffer for a CCB 

context and at present, there are only 16 of these, 8 per CPU, As described above, the host determines when a TCP 

due to SRAM limitations. So more process slots (eg 32*8 connection is able to be handed to the INIC, setup the CCB 

regs each) do not seem worthwhile. and pass it to the card via a command in the Transmit queue. 

A process context switch simply involves reloading the so CCBs that the INIC owns can be handed back to the host via 

process base register based on the process to be restarted, a request from the Receive or Transmit CPUs or from the 

and jumping to the appropriate address for resumption. To host itself at any time. 

better support the process model chosen, the code locks an When the IMC_receives-a-frame,-orie^rits^mn^ate 

active CCB into an SRAM buffer while either CPU is taste~i s~To~de7ermme _if. Me_ftame^ it 

operating on it. This implies there is no swapping to and 55 controls. If not, the framTS pass^toime-hoitW^hatfis 

from DRAM of a CCB once it is in SRAM and an operation terrn^me.slowjpam70n^fr%rnO^-trarKniit-request will 

is started on it. More specifically, the CCB will not be specifxXICGB-numbef _ if the_request-is-on-an~,INIC- 

swapped after requesting that a DMA be performed for it. controlled CCB. Tbus;the?initial state for the INIC will be 

Instead, the system switches to another active process. Once tranga^t^odZin-which-all-received frames are^direcdy 

the DMA is complete, it will resume the former process at 60 passed through and all transmit-requests^TTblTsimply 

the point directly after where the DMA was requested. This rtooWornS^propnate^vire. This state is maintained until 

constitutes a zero-cost switch as mentioned above. Uhe_host-passe S -CCBs t° th^INIC-to-cflntrolrNotenhTT? 

Receive and transmit processing on an individual CCB frames received for whictTihe INIC ha s no CCB (or it is with-/ 

are each controlled by separate state machines; the state the host) wjflstin.have-theTCPTnd'rP checksums.verified 

machines are run from within a process. 65 if TCP/TR^Sumlady^the-Jro^ the INIC 

The initial INIC has 16 MB of DRAM. Utility initialize- calctdate^and inserUh^checteurns on-a.fra 

tion microcode divides a large portion of this into 2 K buffers wKichtthe3NIC_has_no CCB. 



Hardware use (DRAM fifos etc) 




5120 


CCB buffers 


256 bytes * 16 


4096 


CCB headers 


16 bytes " 256 


4096 


Header buffers 


128 bytes * 8 


1024 


Command buffers 


128 bytes * 8 


1024 


Debugger/Stats etc 


1024 






16K bytes 
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There are 512 registers available in the IN1C. The first 256 
are used for process slots. The remaining 256 are split 
between the 3 CPUs. Table 3 lists the register usage. 

TABLE 3 



64 





Reeister Usaee 


0-255: 


16 processes, 16 registers each; 


255-287: 


32 for RCV general processing; 


288-319: 


32 for XMT general processing; 


320-367: 


48 for UTL (CPU 3); 


368-383: 


16 for RCV/XMT processing; 


384-415: 


32 CPU-specific for RCV, 


416-447: 


32 CPU-specific for XMT; 


448-479: 


32 CPU-specific for UTL; 


448-511: 


64 for UTL scratch. 



scan all the hardware queues for new events for this CPU. 
The following is a C-like summary of the main loop: 



forever { 

while(dma_events & OUR_CPU_MASK) { 
clear dma_cvent bit; 
restart waiting process process; 



10 



The following sources of events exist in the INIC: 

1) A Receive input queue — Hardware will automatically 
DMA arriving frames into frame buffers and queue an 
event into the Q_RECV hardware queue. 

2) A General Event register for Timer events — Expiration 
of the 2 millisecond timer will set 2 bits in this register, 
one for each processor. 

3) Transmit request queues — There is one queue allocated 
per interface for requests from the host processor. 
These requests come via the Utility CPU, which ini- 
tially DMAs the request into a small DRAM buffer and 
then queues a pointer to that buffer to the appropriate 
Transmit request queue. 

4) Receive and Transmit CCB events queues — these 
queues are used to pass events to be processed against 
a CCB state machine. The events may initiate in any of 
the CPUs. 

5) Receive and Transmit System queues: these queues are 
used for system events i.e. those not directed at a 
specific CCB. 

6) The Channel Events register: this register contains the 
DMA-completed bits for all 32 DMA descriptors; there 
will be one descriptor allocated for each of the 16 
processes, so that when the relevant bit is set in the 
Channel Events register, the DMA that that process 
fired off is complete. 

As mentioned earlier, there are 16 process slots in which 
to execute. The first 8 are allocated to the Receive CPU, the 
next 8 to the Transmit CPU. 

The microcode is split into 6 sections based on function- 
ality. These sections are: 

The Mainloop; 

Receive frame processing; 

Receive event processing for CCB events; 

Receive command processing; 

Transmit command processing; 

Transmit event processing for CCB events. 

Within each of these divisions exist subdivisions. For 
example, receive-frame processing has code for non-CCB 
frames ("slow-path"), and for CCB frames ("fast-path"). 
These sections will be discussed in the following few pages. 

Receive and Transmit share the same Main Loop code. 
This is made possible because of the CPU-specific registers 
defined by the hardware e.g., 384-415, 416-447. Also the 
functions that the mainloops need to perform are identical. 
The major functions are: 

to check if any DMAs have completed, 

to determine if any process are now restaruble, 

to see if a timer tick has occurred, 



while any processes arc tunable { 

run them by jumping to the start/resume address; 

if (timer_tick) { 

reset timer_tick bit; 

jump to this_cpu_timer_ruie; 

} 

if (available process entries) { 

while(q_out_rdy & OUR_QUEUES_MASK) { 
call appropriate event handler to service the event; 
this will setup a new process to be run (get free process entry, 
header buffer, CCB buffer, set the process up). 
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} 



} 
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Receive-frame processing can be broken down into the 
following stages: 

First, Receive Dequeue and Header Validation, which 
includes considerable hardware assist. Much header valida- 
tion is implemented in hardware in conjunction with MAC 
processing by scanning the data as it flies by. The Receive 
hardware sequencer performs a number of tests to generate 
status from the various headers. 

For the MAC header the Receive hardware sequencer 
determines if Ethemet/802.3, if MC/BC, if it matches our 
MAC address A or B, determines the network protocol, and 
flags if not a MAC status of "good packet." 

For the Network header the Receive hardware sequencer 
determines if header checksum is valid, header length is 
valid (e.g. IP >=5), network length>header length, what the 
transport protocol is, if there is any fragmentation or net- 
work options, and whether the destination network address 
is ours. 

For the Transport header the Receive hardware sequencer 
determines if the checksum is valid (incl. pseudo-header if 
relevant), header length is valid (e.g. TCP >=5), length is 
valid, what is the session layer protocol (e.g. SMB, HTTP or 
45 FTP data), are there any transport flags set (e.g. FIN/SYN/ 
URG/RST bits), and any options present. 

As frames are received by the INIC from a network, they 
are placed into 2K-byte DRAM buffers by the Receive 
hardware sequencer, along with 16 bytes of the above frame 
50 status. A pointer to the last byte+1 of this buffer is queued 
into the Q_RECV queue. The pointer contains a bit (bit 29) 
that informs the microcode if this frame is definitely not a 
fast-path candidate (e.g., not TCPIP, or has an error of some 
sort). Receive frame processing involves extracting this 
55 pointer from the Receive hardware queue, and setting up a 
DMA into an SRAM header buffer of the first X bytes from 
the DRAM frame buffer. The size of the DMA is determined 
by whether bit 29 is set or not. If it is set (this frame is not 
a fast-path candidate), then only the status bytes are needed 
60 by the microcode, so the size would be 16 bytes. Otherwise 
up to 92 bytes are DMA'd— sufficient to get all useful 
headers. When this DMA is complete, the status bytes are 
used by the microcode to determine whether to jump to 
fast-path or slow-path processing. 

If bit 29 is set, this frame is going slow-path. Effectively 
this means that the frame will not be processed against an 
on-INIC CCB. It will be passed directly to the host, although 
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if the frame is TCP/IP, then its checksums have already been 
validated by the hardware. Also, all other header validations 
have been performed. 

If bit 29 is not set, then there may be an onboard CCB for 
this frame. The Receive sequencer has already generated a 
hash based on the network and transport addresses, e.g., IP 
source and destination addresses and TCP ports. This hash is 
used to index directly into a hash table on the INIC that 
points to entries in a CCB header table. The header table 
entries are chained on the hash table entry. The microcode 
uses the hash to determine if a CCB exists on the INIC for 
this frame. It does this by following this chain from the hash 
table entry, and for each chained header table entry, com- 
paring its source and destination addresses and ports with 
those of the frame. If a match is found, then the frame will 
be processed against the CCB by the INIC. If not, then the 
frame is sent for slow-path processing. 

In the first product, the fast-path has been implemented as 
a finite state machine (FSM) that covers 3 layers of the 
protocol stack, i.e., IP, TCP and Session. The state transitions 
for the Receive FSM and the events that cause them are 
discussed below. 

The following summarizes the steps involved in normal 
fast-path frame processing: 

1) Get control of the associated CCB; this involves 
locking the CCB to stop other processing (e.g. 
Transmit) from altering it while this receive processing 
is taking place. 

2) Get the CCB into an SRAM CCB buffer; there are 16 
of these buffers in SRAM and they are not flushed to 
DRAM until the buffer space is needed by other CCBs. 
Acquisition and flushing of these CCB buffers is con- 
trolled by a hardware LRU mechanism. Thus getting 
the CCB into a buffer may involve flushing another 
CCB from its SRAM buffer. 

3) Examine the frame header to generate an event from it. 
The Receive events that can be generated on a given 
context from a frame are: 

receive a valid and complete Session layer packet; 
receive a valid and incomplete Session layer packet; 
receive a pure ACK; 

receive an "invalid" frame, i.e., one that causes the 

CCB to be Bushed to the host; 
receive a window probe; 
receive a partial/split NetBios header. 

4) Process the event against the CCBs FSM using that 
frame. 

Each event and state intersection provides an action to be 
executed and a new state. The following is an example of a 
state/event transition, the action to be executed and the new 
stale: Assume the state is IDLE (SR_NI), and the event is 
VALID INCOMPLETE RECEIVE FROM THE RCV 
QUEUE (ER_VRIR). The action from this state/event inter- 
section is AR_RPHH and the next state is WAIT MDL, 
CCB Q EXISTS (SR_WMTQ). To summarize, the first of 
an incomplete Session layer packet has been received. For 
example, if the Session layer is NetBIOS, then this frame 
contains the NetBIOS header, but it does not contain all the 
Session layer data. The action performs the following steps: 

1) DMA a small amount of the payload (192 bytes) into 
a small host header buffer; 

2) Process the amount sent to the host through TCP — it 
has been delivered; 

3) Queue the frame to the internal CCB frame queue in the 
CCB SRAM buffer; 

4) DMA appropriate receive status into the header buffer, 
including setting the COMPLETE bit; 
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5) Post ISR status to the Utility CPU via the Q_EVENT2 
queue, so that it will generate a host interrupt with it; 

6) Generate an event to the Transmit CPU via the 
Q_EVENT1 queue to check if output is now possible; 
and 

7) Exit from Receive FSM processing. 
The following steps summarize slow-path Receive pro- 
cessing by the INIC: 

1) Examine frame status bytes to determine if frame is 
in-error, if so, only these status bytes will be sent to the 
host; 

2) Move the frame into either a small or a large host buffer 
via DMA. It is not split across these buffers; 

3) Set frame status and address details and DMA status to 
the host; 

4) Send event to the Utility processor to post Receive 
status in the ISR. 

Once the INIC is handling CCBs, i.e. fast-path 
processing, there are numerous other events that need to be 
processed apart from received frames for that CCB. The 
following are the relevant events: 
lock a new context (from Xmit); 
unlock a new context (from Xmit); 
receive frame (complete or incomplete) from the CCB 
queue; 

receive window update from the CCB queue; 
receive a partial/split NetB header from the CCB queue; 
end of the CCB queue of frames; 
flush context request from host; 
flush context request from Xmit- 
context release/flush complete from Xmit. 
The following summarizes Receive Event processing: 

1) Get control of the associated CCB; this involves 
locking the CCB to stop other processing (e.g. 
Transmit) from altering it while this processing is 
taking place; 

2) Get the CCB into an SRAM CCB buffer; 

3) If the event is "Check CCB queue", check the internal 
queue in the CCB; if there are frames queued, dequeue 
the next one, get its header into an SRAM header buffer 
and examine it to generate a specific event; if no frames 
are queued, exit; 

4) Either way, process the event against the CCB's FSM. 
FIG. 19 provides a state diagram summary of the Receive 

FSM states and the main events and transitions. Processing 
Receive Commands by the INIC occurs when the host posts 
a receive MDL to the INIC by filling in a command buffer 
with appropriate data pointers, etc., and posting it to the 
50 INIC via the Receive Command Buffer Address register. 
Note that there is only one host receive MDL register. The 
INIC Utility CPU will DMA the command in and place a 
pointer to it in the Q_RCMD queue which the Receive CPU 
will work on. 

There are two possible commands sent to the INIC from 
the host and both apply only to fast-path processing. The 
commands are: 
Receive MDL for remaining session-layer data; 
Abort/flush a context. 

The following summarizes Receive Command process- 
ing: 

1) Get an SRAM command buffer and get the first 32 
bytes of the command into it; 

2) Determine the CCB involved and get control of it; this 
involves locking the CCB to stop other processing (e.g. 
Transmit) from altering it while this processing is 
taking place; 
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3) Get the CCB into an SRAM CCB buffer; If the frame is TCP/IP, the checksum will be appropriately 

4) Generate an event based on the command type; adjusted if necessary (pseudo-header etc) and placed in the 

5) Process the event against the CCB's FSM TCP header. The frame is then queued to the appropriate 
As mentioned above, the fast-path has been implemented MAC t™?™ 1 interface. Then the command is immediately 

as a finite state machine (FSM) that covers at least 3 layers 5 R as P onded to m * appropriate status through the Host 

of the protocol stack, i.e., IP, TCP and Session. There are espouse queue. 

.11 . i7o»i * n • j-i- ■ The followmg summarizes the steps performed: 

actually separate FSMs for Receive and Transmit. The state • . - . , .,, 

involved is the state of the CCB connection (Receive or *> renliunder of the command if larger than 32 

Transmit) and encompasses those 3 layers. Events are gen- by,eS ' mt0 ,he L SRAM command buffer. This implies 

erated from the sources of events detailed earlier, and they 10 ^ ™ n ™ a,rf J ™* ^ Ialger ±an the 

are applied against the FSM giving an action to execute and S !f ° fthe SRAM command buffer (128 bytes); 

a new state. 2 ) ExarmrJe command to determine if output TCP/IP 

Several Receive details should be noted. First, regarding checksumming is required; 

window updates from the host, the host application has to 3 ) When out P ul checksumming is required: 

tell the INIC when that application has accepted the received 15 The host sets the length of the MAC through TCP headers 

data that has been queued. This is so that the INIC can mto me comnlaild for tne INIC - Th^ is so that the Receive 

update the TCP receive window. This is achieved by pig- CPU mav DMAt be header into an SRAM buffer to calculate 

gybacking these on transmit or receive MDLs on the same and xl IP/TCP checksums. Use half of the command 

CCB. Second, for an INIC-controlled CCB, the INIC does buffer as a header buffer for wis purpose. This avoids using 

not maintain a keepalive timer. This leaves the host with the 20 an SRAM CCB buffer that would cause an unnecessary flush 

job of determining that the CCB is still active. Third, a to DRAM of a CCB buffer. Doing this may result in unused 

Timestamp option is supported in the fast path because it command fields being moved down over those fields that 

leads to better round-trip estimations (RTT) for TCP. How- have ^tezdy been loaded into CPU registers, so as to gain 

ever this is optional depending upon SRAM size limitations s P ace m me SRAM buffer. Even with this trick, there is a 

on the on-INIC CCB. Fourth, the INTC maintains an Idle 25 maxi mum header size that the host can send for a frame for 

timer for CCBs under its control. which checksumming is requested (82 bytes). 

Transmit Command Processing begins when the host DMAthe header from host memory to the header buffer, 

posts a transmit request to the INIC by filling in a command DMA the remainder of the frame from the host to the 

buffer with appropriate data pointers, etc., and posting it to appropriate offset in a large DRAM buffer, leaving room for , 

the INIC via the Command Buffer Address registers. Note 30 me frame headers. Note that the command is derived from 

that there is one of these registers per interface. The INIC an MDL on ,he nost ma y contain scatter/gather lists that 

Utility CPU will DMA the command in and place it in the need t0 be processed. This latter DMA will provide the TCP 

appropriate Q_XMIT queue which the Transmit CPU will checksum of the payload. Manually calculate and insert the 

work on. There is also one of these queues per interface so IP checksum in the SRAM header buffer. Then calculate the 

that transmit processing can round-robin service these 3S checksum of the TCP header and pseudo-header in the 

queues to keep all interfaces busy, and not let a highly-active SRAM header buffer and add in the payload checksum, 

interface lock out the others (which would happen with a lnsert ^ rcp checksum into the frame header. Then DMA 

single queue). toe entire header to the front of the DRAM buffer and queue 

There are 4 possible commands sent to the INIC from the me buffer to the appropriate Xmit hardware queue based on 

host. The commands are: 40 me requested interface in the command. Post new ISR status 

1) Null command-essentially just a window update; 10 * £ ' ™ ity P™*™* to be P^ d to 
fast-path only; ' hen no checksumming is required: 

„ . . ' . . DMA the entire frame from host memory into a large 

2) New context pendmg; fast-path only; DRAM buffer and queue the buffer to the appropriate Xmit 

3) New context confirm; fast-path only; 45 hardware queue based on the requested interface in the 

4) Transmit command; fast- and slow-path. command. Note that the command is derived from an MDL 
The following summarizes Transmit Command process- on the host and may contain scatter/gather lists that need to 

ing: be processed. Post new ISR status to the Utility processor to 

1) Get an SRAM command buffer and get the first 32 be P assed t0 the host - 

bytes of the command into it; 50 The following is an overview of the Transmit fast-path 

2) Determine if there is a CCB involved and if so, get fl ° W ° DCe a command has been P° sled - The transmit request 
control of the it; this involves locking the CCB to stop m ^ 3 ^T^lT * ^ may be 35 
other processing (e.g. Transmit) from altering it while m lJP^ 35 a ^ 56851011 ' a y« packet. The former request 

this processing is taking place; ^ , S ° T °™ °* Utter 35 4 DUmber ° f 

„„„ . . . , .f. ' _ . OIWW „ ss MSS-sized segments. The transmitting CCB must hold on to 

3) If a CCB is involved, get the CCB into an SRAM CCB ^ request until all data in it has been transmitted and acked. 
buffer and generate an event based on the command Appropriate pointers to do this are kept in the CCB To 
type; then process that event agamst the CCB's FSM; create „ output TCPnP seffamU a large DRAM buffer is 

4) Otherwise perform slow-path transmit command pro- acquired from the Q_FREEL queue. Then data is DMAd 
cessing. 60 from host memory into the DRAM buffer to create an 

For Transmit Slow-Path Processing, the queued request MSS-sized segment. This dma also checksums the data. The 

will already have been provided by the host stack with the TCP/IP header is created in SRAM and DMAd to the front 

appropriate MAC and TCP/IP (or whatever) headers in the of the payload data. It is quicker and simpler to keep a basic 

frame to be output. Also the request is guaranteed not to be frame header permanently in the CCB and DMA this 
greater than MSS-sized in length. So the processing is fairly 65 directly from the SRAM CCB buffer into the DRAM buffer 

simple. A large buffer is acquired and the frame is moved by each time. Thus the payload checksum is adjusted for the 

DMA into it, at which time the checksum is also calculated. pseudo-header and placed into the TCP header prior to 
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DMAing the header from SRAM. Then the DRAM buffer is 
queued to the appropriate Q_UXMT transmit queue. The 
final step is to update various window fields etc in the CCB. 
Eventually either the entire request will have been sent and 
acked, or a retransmission timer will expire in which case 5 
the context is flushed to the host. In either case, the IN1C will 
place a command response in the Response queue contain- 
ing the command buffer handle from the original transmit 
command and appropriate status. 

The above discussion has dealt with how an actual to 
transmit occurs. However the real challenge in the transmit 
processor is to determine whether it is appropriate to trans- 
mit at the time a transmit request arrives, and then to 
continue to transmit for as long as the transport protocol 
permits. There are many reasons not to transmit: the receiv- is 
er's window size is <=0, the Persist timer has expired, the 
amount to send is less than a full segment and an ACK is 
expected/outstanding, the receiver's' window is not half- 
open etc. Much of transmit processing will be in determining 
these conditions. 20 

The fast-path has been implemented as a finite state 
machine (FSM) that covers at least 3 layers of the protocol 
stack, i.e., IP, TCP and Session. The state transitions for the 
Transmit FSM and the events that cause them are discussed 
below. 25 

The following summarizes the steps involved in normal 
fast-path transmit command processing: 

1) Get control of the associated CCB (gotten from the 
command); this involves locking the CCB to stop other 
processing (e.g. Receive) from altering it while this 3 ° 
transmit processing is taking place; 

2) Get the CCB into an SRAM CCB buffer; there are 16 
of these buffers in SRAM and they are not flushed to 
DRAM until the buffer space is needed by other CCBs. 

Acquisition and flushing of these CCB buffers is con- 3S 
trolled by a hardware LRU mechanism. Thus getting the 
CCB into a buffer may involve flushing another CCB from 
its SRAM buffer; 

3) Process the SEND COMMAND (EX_SCMD) event 
against the CCB's FSM. 

Each event and state intersection provides an action to be 
executed and a new state. The following is an example of the 
state/event transition, the action to be executed and the new 
state for the SEND command while in transmit state IDLE 
(SX_IDLE): The action from this state/event intersection is 
AX_NUCMD and the next state is XMIT COMMAND 
ACTIVE (SX_XMIT). To summarize, a command to trans- 
mit data has been received while Transmit is currently idle. 
The action performs the following steps: 

1) Store details of the command into the CCB; 

2) Check that it is OK to transmit now e.g. send window 
is not zero; 

3) If output is not possible, send the Check Output event 
to Q_EVENT1 queue for the Transmit CCB's FSM 55 
and exit; 

4) Get a DRAM 2K-byle buffer from the Q_FREEL 
queue into which to move the payload data; 

5) DMA payload data from the addresses in the scatter/ 
gather lists in the command into an offset in the DRAM 60 
buffer that leaves space for the frame header; these 
DMAs will provide the checksum of the payload data; 

6) Concurrently with the above DMA, fill out variable 
details in the frame header template in the CCB; also 
get the IP and TCP header checksums while doing this; 65 
note that base IP and TCP header checksums are kept 
in the CCB, and these are simply updated for fields that 
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vary per frame, viz. IP Id, D? length, IP checksum, TCP 
sequence and ack numbers, TCP window size, TCP 
flags and TCP checksum. 

7) When the payload DMA is complete, DMA the frame 
header from the CCB to the front of the DRAM buffer; 

8) Queue the DRAM buffer to the appropriate Q_UXMT 
queue for the interface for this CCB; 

9) Determine if there is more payload in the command; if 
so, save the current command transfer address details in 
the CCB and send a CHECK OUTPUT event via the 
Q_EVENTI queue to the Transmit CCB; if not, send 
the ALL COMMAND DATA SENT (EX_ACDS) 
event to the Transmit CCB; 

10) Exit from Transmit FSM processing. 
Once the INIC is handling CCBs, i.e. fast-path 

processing, there are numerous other events that need to be 
processed apart from transmit commands from the host for 
that CCB. The following are the relevant events: 

1) New context pending (from the new context pending 
command); 

2) New context confirm (from that command also); 

3) Flush context request from Receive; 

4) Send data (after Check Output determines this can be 
done); 

5) Send an ACK (from Receive); 

6) All command data sent; 

7) Received ACK for all outstanding on a command; 

8) Persist conditions detected (WIN=0, no RTR, no PST); 

9) Context flush event detected (e.g. RTR expired); 

10) Send a window update; 

11) Persist timer expired; 

12) Check for RTR expiry; 

13) Check for PST expiry; 

14) Maybe send an ACK; 

15) Maybe update the IDLE timer; 

16) Context termination sync event received. 
The following summarizes Transmit Event processing: 

1) Get control of the associated CCB; this involves 
. locking the CCB to stop other processing (e.g. Receive) 

from altering it while this processing is taking place. 

2) Get the CCB into an SRAM CCB buffer; 

3) If the event is "Check Output", check whether it is now 
possible to output on this CCB; if so, process the Send 
Data (EX_SD) event; if not, check for other conditions 
e.g. all of a command's data has been ACKed (EX_ 
RACK), a window update is needed (EX_SWU), 
output is available but it is not possible to send (EX 
WE0); 

4) If there is any event, process it against the CCB's FSM. 
FIG. 20 provides a state diagram summary of the Transmit 

FSM states and the main events and transitions. The state 
involved is the state of the transmit path of the CCB 
connection. Events are generated from the sources of events 
detailed above, and they are applied against the FSM giving 
an action to execute and a new state. The following diagram 
provides a summary of the Transmit FSM states and the 
main events and transitions. 

Several Transmit details should be noted. First, regarding 
the slow-start algorithm that is now a part of the TCP 
standard will be handled on the INIC. It seems unreasonable 
to wait until the connection is sending at full-rate before 
passing it to the INIC. 

Also, the congestion algorithm will not be handled on the 
card. To reach congested state, the connection will have 
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dropped frames, so it will have flushed. The host will NOT formatted, the data necessary to configure PCI can be found 
hand out a CCB m congestion state— it will wait until it has in this device. If EEPROM does not exist, but FLASH is 

gotten out of that state. available and properly formatted, data to configure PCI is 

A Window Probe is sent from the sending CCB to the obtained from the FLASH memory. If neither of these 

receiving CCB, and it means the sender has the receiver in 5 options is available, Pa configuration space is set up using 

PERSIST state. Persist state is entered when the receiver ROM defaults. In this case bit 0 of the external options 

advertises a zero window. It is thus the state of the trans- indicates that the debug processor should be initialized Both 

muting CCB. In this state, he sends periodic window probes EEPROM and FLASH read routines are contained in ROM 

to the receiver in case an ACK from the receiver has been as they are required for PCI configuration. The FLASH read 

lost. The receiver will return his latest window size in the 10 routine is fairly straightforward. The EEPROM routines use 

ACK - the bit level interface of the EEPROM. Refer to the 

A Window Update is sent from the receiving CCB to the EEPROM specs to find a description of the operation of this 

sending CCB, usually to tell him that the receiving window interface. 

has altered. It is mostly triggered by the upper layer when it Once PCI has been configured IN1C is ready to talk to the 

accepts some data. This probably means the sending CCB is is system. At this point minimal functionality is available The 

viewing the receiving CCB as being in PERSIST state.Per- mini idle loop provides only two functions, a branch to 

sist state: it is planned to handle Persist timers on the INIC. check status, and a branch to a small command decode 

However as soon as the Persist timer completely expires, the function. The mini idle loop shares the check status routine 

CCB will be flushed. This means that a zero window has with the main idle loop, and uses a very small portion of its 

been advertised for a few seconds. A zero window would 20 function. The check status routine will be described within 

normally be a transient situation, and would tend to happen the main idle loop description. The command decode func- 

mostly with clients that do not support slow-start. However tion supports FLASH reads, setting the interrupt status 

it should normally reopen before the timer expires. pointer, setting the status, setting the mask, and writing 

The INIC code expects all transmit requests for which it control store, 

has no CCB to not be greater than the MSS. If any request 25 Control store writes are done in three consecutive instruc- 

is, it will be dropped and an appropriate response status lions. The first transfer is the address to be written. This 

P osted - transfer also includes two control bits, one to indicate that 

As a receiver, the INIC will do the right thing regarding this is a compare rather than a write, and one to indicate that 

Silly Window avoidance and not advertise small windows— at the completion of this operation we should jump to the 

this is easy. However it is necessary to also do things to 30 start address in writeable control store. The second transfer 

avoid this as a sender, for the cases where a stupid client is the low half of the control store instruction, and the third 

does advertise small windows. Without getting into too transfer is the high half. 

much detail here, the mechanism requires the INIC code to At the completion of the load of control store P2 branches 

calculate the largest window advertisement ever advertised to the newly downloaded code. Once this occurs, DRAM is 

by the other end. It is an attempt to guess the size of the other 35 initialized, and then its size is computed. This is done by first 

end's receive buffer and assumes the other end never reduces determining its configuration. By setting the addressing 

the size of its receive buffer. See Stevens, Vol. 1 pp. structure to maximum and writing to address IcOO, the 

325-326. memory configuration can be computed. If this write aliases 

The third processor (P2) of the integrated processors is to OcOO address bit 12 is missing. If the write also aliases to 

termed the Utility Processor. P2 performs the reset function, 40 0400 bit 11 is missing. Once this has been determined the 

manages the interface to the system, and performs the debug proper addressing structure can be initialized. Once the 

function. This following pages will describe these functions proper addressing configuration has been set, the size of 

in the format found in the code. The first major function is DRAM can be determined using the same alias technique to 

reset. Second is the system interface, which is composed of determine missing high order address bits, 

the idle loop and associated routines. Last is the debug 45 The final major reset function that is performed is queue 

function - initialization. Each queue uses 128 bytes of SRAM, and a 

Two reset functions have been implemented, a hard or configurable amount of DRAM, from a minimum of IK 

cold reset, and a soft or warm reset. Power up or the bytes to a maximum of 128K. First the queues are initialized 

occurrence of a system reset causes hard reset. Soft reset to the DRAM size defined by control store constants. Each 
occurs as a result of the system writing 'dead' to location 0 so queue begins its use of DRAM on the 128K boundary 

of INIC memory. P2 distinguishes between these two resets following the beginning of the previous queue, so after the 

by the condition of the write pending bit in the PCI address queues are initialized, a mechanism for recovering the free 

register. If this bit is on, a soft reset has occurred, and PCI space between queues that have not been initialized to 

configuration space will not be set up. maximum size is initiated. 

One of the functions of P2 in the reset process is to load 55 Two queues are allocated for use as an aid to managing 

the writeable control store (WCS) with code provided by the local DRAM. One queue contains addresses of 256 byte 

system. In order to bypass this sophisticated mechanism to blocks, and one contains addresses of 2K blocks. The 2K 

enable the load of code for in-circuit test, a synchronous queue size is determined by DRAM size, rather than a 

mechanism using all three processors has been designed. If control store constant. After all queues have been initialized 
bit 1 of the external options register has been set, all three 60 the process of allocating DRAM not used by the queues is 

processors will perform this finction. begun. First blocks at the end of the first queue are added to 

Only those functions necessary to be able to load WCS the 256 byte queue until a 2K boundary is found at which 

from the host are implemented in ROM. The remaining point 2K blocks are added to the 2K queue 'until the 

functions implemented in ROM are subroutines that can be beginning of the next queue is reached. This process is 
easily rewritten in WCS if errors are discovered later. First 65 repeated until the DRAM located between the last and next 

of the ROM functions is the initialization of PCI configu- to last queue has been recovered. At this point the 2K queue 

ration space. If the EEPROM exists and is properly is filled with the remaining DRAM until the bottom address 
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of the CCB storage area is reached. At this point, entries are times prior to halting the processor under test When a break 

removed from the 2K queue in order to be split down and instruction is issued to P2, it checks to see if this is already 

added to the 256-byte queue until it is close to full. In order a break point for the other processor, and if so simply turns 

to avoid partial tail full issues, this queue is not completely on the bit to indicate both processors are using this break 

Ued - s point- If the address is not presently a break point, P2 finds 

At the conclusion of queue initialization P0 and PI are the next available storage location in SRAM for this break 

started, and P2 enters the idle loop. The idle loop is the code point information. It then stores the address of the break 

executed by P2 when it is wailing for the next task. The point, and the contents of the instruction at that address in 

purpose of the loop is to check all of the conditions that the SRAM storage location. It then stores a jump instruction 
could initiate an operation. At the highest level, these are: 10 to the breakpoint execution code at the location of the break 

P0 or PI hit a debug processor installed breakpoint; point. 

A system request has occurred over PCI; Each break point has a unique location that it jumps to in 

A DMA channel has changed state- ' order 10 9 uickl y determine the address of the location where 

A network interface has changed state; ^ replaced ^trucuon can be found. It also allows the 
. . ... 15 processor to determine if this break is for it or the other 

A process has requested status be sent to the system; potential processor under test. The break point jump 

A transmitter or receiver has stored statistics. instruction, in addition to jumping to the break point code 

These functions are checked in this order. If service is saves the hardware status of the system . When the processor 

required at any check, it is provided and the loop is begun takes this jump, it saves the remaining state required to allow 
at the beginning. Thus if the system becomes very busy, the 20 it to operate independently, and then determines if the break 

most likely thing to find itself being serviced less is the was intended for itself. If it was not, it builds the instruction 

statistics function. Service of processor baits due to break that was replaced, restores the state, executes the instruction 

points will be discussed in conjunction with the description and returns to the normal code. If however it determines that 

of the debug processor. Service of system requests can be the break instruction was for it, it sets a flag for P2 and halts 
broken into two major subsets. The first is system requests 25 When P2 discovers in the idle loop that a halted processor 

from the functional device driver, and second is system has set a flag, it steps the processor through the same code 

requests from the debug driver. described above that it would have otherwise executed in 

The Functional Command Decode performs the requests real time. It then leaves the processor under test stopped in 

described in the INIC Utility Processor description. Most the instruction after the break instruction, and sends status to 
requests are interface dependent A request is made for a 30 the system that the processor has encountered a break point 

specific interface to perform a specific function. As an Resetting a break point causes the instruction to be restored 

example, Pa addresses 10, 14, 18 and 1C are used to store to its original condition in control store and the storage 

the pointer to the system memory location where interrupt location in SRAM to be cleared. 

status should be stored for network interfaces 0, 1, 2, and 3 P2 can also perform a DMA channel State Change For 
respectively. A limited number of commands are not inter- 35 the four transmit command buffer and the receive buffer 

face dependent, and are generally intended to be used on functions, P2 will DMA the command buffer into local 

interface 0. These are queue a receive buffer, write control memory, modify the pointer for use by the transmit or 

store, read eeprom, and the flash read and write commands. receive processors, and add the pointer to the proper queue. 

Most of these commands simply cause a value to be This task is split into three separate functions in order to 
stored, after which P2 returns to the idle loop. If a DMA 40 keep this function operating concurrently with all other 

operation is requested, at the end of the operation, status operations. 

indicating the successful or unsuccessful completion of the The first part of the process is the actual command 

request will be sent to the system. Those that initiate a DMA decode. A single queue (Q_HOSTIF) is used to store 

and generate a later status presentation are read statistics, requests for the four separate transmit functions and the 

read PHY status, write configuration, and read configuration. 45 receive function. At command decode time two entries are 

In addition, the four transmit command buffer requests, stored on Q_HOSTIF: the address of the queue lhat will 

along with the receive command buffer request cause a ultimately be the destination of the buffer, and the pointer to 

DMA to be performed, but no status is required by the the location in system memory where the buffer resides 

system after the completion of these DMA operations. The The second part of this operation occurs when the idle 

function of these operations will be covered under the idle 50 loop detects that Q_HOSTlF is not empty A non-empty 

loop DMA service discussion. condition indicates a request to initiate the DMA of the 

As with the functional processor, the INIC Debug Inter- buffer to INIC local DRAM. When this occurs P2 first 

face description covers the basic function of this code. The determines if a DMA channel is available. Channels 23-26 

halt, run, step, dump and load commands are all fairly are used for this purpose. If a channel is available, a buffer 

straightforward and are documented in the above referenced 55 is obtained from the free queue and a DMA operation is 

spec. Although break is functionally described, further initiated to this buffer. The final destination queue address 

explanation of the operation of this code is contained in this and the address of the end of the buffer are stored in an 

document. The functions of the debug processor that are SRAM location linked to this DMA channel, and P2 returns 

covered in the Utility Processor document do not require to the idle loop. 

status presentation. All of the commands, triggered by 60 The final part of this operation occurs when it is deter- 
stonng a pointer in the command location, do require ending mined in the idle loop lhat the DMA operation has corn- 
status to be presented. pleted. The SRAM location linked to this channel contains 
The break function requires twelvebytes of storage for the queue address and the data pointer to be queued. P2 
each break point that is stored. Each break point can cause obtains this data and queues the pointer, completing the 
either one or both processors to halt, or can simply trigger 65 operation. 

an indication that the instruction at that location has been In addition to the SRAM locations used to store a descrip- 

executed. Each break point can be executed a specified n tion of the active DMA operation, four bits are used in the 
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dma_stalus register. These bits are used to indicate that 
there has been a DMA operation initiated on their respective 
channel. During part two of the above process the bit is used 
to determine channel availability, and is set once a channel 
is acquired. During part three the bit is reset. 

For tracking network interface changes, a register (link_ 
stat) is maintained with the current status of each of the 
network interfaces. When one or more of the interfaces 
changes status (as defined by this register) status is set up to 
notify the system of this change. 

The function of the check status routine is to check to see 
if any functions have requested status to be sent to the 
system, and to send it if such a request has been made. The 
first step in this process is to reset any DMA channels that 
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could have the system write to all five devices (the four 
network processors and the debug processor) context is 
stored for the return path and the operation outstanding. The 
flags register contains five bits, one for each process, indi- 
s eating that this disconnected dma operation is in progress, 
and five registers contain the return addresses for each of the 
processes. 

The remainder of this document will describe the 1N1C 
hardware specification. This features an INIC peripheral 
10 component interconnect (PCI) interface which supports both 
5.0V and 3.3V signaling environments, both 32-bit and 54 
bit PCI interface, and PCI clock frequencies from 15 MHz 



to 66 MHz. Other features of this interface include a high 
performance bus mastering architecture, host memory based 
have completed a transfer. Once this has been accomplished, is communications that reduce register accesses, host memory 
P2 checks to see if there are any requests to send new status. based interrupt status word which reduces register reads. 
If there are not, P2 returns to the idle loop. If there are Plug and Play compatibility, PCI specification revision 2 1 
requests outstanding, P2 checks to make sure that there is compliance, PCI bursts of up to 512 bytes, supports of cache 
not another request being serviced for that interface, or that line operations up to 128 bytes, supports both big-endian and 
a previously sent status has not yet been reset by the system. 20 little-endian byte alignments, and supports Expansion ROM. 
If there is a request for service outstanding and neither of The INIC Network Interface includes four internal 802.3 
these conditions exists, an operation to send status to the and ethernet compliant Macs, a Media Independent Interface 
system is initiated. (Mil) connectable to external PHYs and supporting 

The first step in this operation is to insure that if there are 10BASE-T, 100BASE-TX/FX and 100BASE-T4 in full and 
multiple requests ready to be serviced they are served in a 25 half-duplex modes. Automatic PHY status polling notifies 



round robin fashion. Once an interface has been selected 
using this method P2 determines if interrupts are being 
aggregated. If they are, the time is checked, and if we are 
still within the aggregation window P2 returns to the idle 
loop. If the timer has expired, P2 first checks that the host 
has not sent back the status register with the status bits we 
want to set already set. Although this is an unlikely 
occurrence, if the host prefers to not see certain status from 
the INIC, this is a possible mechanism for insuring that 
outcome. If this does occur, P2 returns to the idle loop. 

If this is indeed new status that has not been returned to 
INIC, P2 sends this status to the system. At the conclusion 
of this operation P2 checks to see if interrupts are masked, 
and returns to the idle loop if they are. If they are not, an 
interrupt is generated and then P2 returns to the idle loop. 

The flag register serves to interlock the status areas with 
the system. When status is sent to the system, a bit in the flag 
register corresponding to the DMA channel used is set. This 
bit is not reset until after the system writes status back to us 



the system of status changes. SNMP statistics counters are 
provided, broadcast and multicast packets are handled with 
a promiscuous mode provided for network monitoring or 
multiple unicast address detection. The interface supports 
30 huge packets of 32KB, Mac-layer loop-back test mode, and 
auto-negotiating Phys. 

The INIC memory features include external Dram buff- 
ering of transmit and receive packets, buffering configurable 
as 4MB, 8MB, 16MB or 32MB, with a 32-bit interface that 
35 supports throughput of 224MB/S. External FLASH ROM up 
to 4MB is provided, for diskless boot applications, as well 
as external serial EEPROM for custom configuration and 
Mac addresses. 
The INIC Protocol Processor includes a high speed, 
40 custom, 32-bit processor executing 66 million instructions 
per second, and processing various protocols with focus on 
IP, TCP and NETBIOS. Up to 256 resident TCP/IP contexts 



can be cached on the INIC for fast-path processing. A 
writeable control store (WCS) allows field updates for 
Each functional sub-processor, utility and network 0-3, has 45 feature enhancements, 
its own status area and flag register interlock. The status The INIC Power includes a 3.3V chip operation and PCI 
areas are located sequentially in SRAM, and the bits in the controlled 5.0V/3.3V I/O cell operation. Initial packaging 
flag register, for convenience, correspond to the DMA includes 272-pin plastic ball grid array, with 91 PCI signals 
channel that is allocated to the sub-processor. The utility 68 Mil signals, 58 external memory signals, 1 clock signal 
processor uses channel 31, and the four network processors so and 54 signals split between power and ground. 



use channels 30-27. Because there are only four available 
interrupts, network processors 0 and 1 share interrupt A 

For maintaining statistics, when either a transmit or 
receive processor completes a transfer, it posts completion 
status information in the Q_STATS queue. P2 recovers 
these entries, analyzes them, and updates the local statistics 
counters. This function is performed only when no more 
pressing requests for P2 are outstanding. 

An outbound dma transfer generated by the INIC (a 
system read) can not pass a system pci write through either 
a host/pci or pei/ped bridge. We must, therefore, insure that 
we disconnect on all outbound dma transfers so that if the 
system tries to write to us we won't deadlock. All outbound 
dma operations are short, containing control data. When one 
of these operations occurs, the idle loop is shortened to 65 
check only pci writes and the completion of the dma of 
control data. However, because a pathological worst case 
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The microprocessor is a 32-bit, full-duplex, four channel, 
10/100-Megabit per second (Mbps), Intelligent Network 
Interface Controller, designed to provide high-speed proto- 
col processing for server applications. It combines the 
functions of a standard network interface controller and a 
protocol processor within a single chip. Although designed 
specifically for server applications, The microprocessor can 
be used by PCs, workstations and routers or anywhere that 
TCP/IP protocols are being utilized. 

When combined with four 802.3/MII compliant Phys and 
Synchronous Drain (SDram), the microprocessor provides 
four complete ethernet nodes. It contains four 802.3/ethemel 
compliant Macs, a PCI Bus Interface Unit (BIU), a memory 
controller, transmit fifos, receive fifos and a custom TCP/ 
IP/NETBIOS protocol processor. The microprocessor sup- 
ports lOBase-T, 100BaseTX, 100Base-FX and 100Base-T4 
via the Mul interface attachment of appropriate Phys. 
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The microprocessor Macs provide statistical information 
that may be used for SNMP. The Macs operate in promis- 
cuous mode allowing the microprocessor to function as a 
network monitor, receive broadcast and multicast packets 
and implement multiple Mac addresses for eacb node. 

Any 8023/MI1 compliant PHY can be utilized, allowing 
the microprocessor to support 10BASE-T, 10BASE-T2, 
100BASE-TX, 100Base-FX and 100BASE-T4 as well as 
future interface standards. PHY identification and initializa- 
tion is accomplished through host driver initialization rou- 
tines. PHY status registers can be polled continuously by the 
microprocessor and detected PHY status changes reported to 
the host driver. The Mac can be configured to support a 
maximum frame size of 1518 bytes or 32768 bytes. 

The 64-bit, multiplexed BIU provides a direct interface to 
the PCI bus for both slave and master functions. The 
microprocessor is capable of operating in either a 64-bit or 
32-bit PCI environment, while supporting 64-bit addressing 
in either configuration. PCI bus frequencies up to 66 MHz 
are supported yielding instantaneous bus transfer rates of 
533MB/S. Both 5.0V and 3.3V signaling environments can 
be utilized by the microprocessor. Configurable cache-line 
size up to 256B will accommodate future architectures, and 
Expansion ROM/Flash support will allow for diskless sys- 
tem booting. Non-PC applications are supported via pro- 
grammable big and little endian modes. Host based com- 
munication has been utilized to provide the best system 
performance possible. 

The microprocessor supports Plug-N-Play auto- 
configuration through the PCI configuration space. External 
pull-up and pull-down resistors, on the memory I/O pins, 
allow selection of various features during chip reset. Support 
of an external eeprom allows for local storage of configu- 
ration information such as Mac addresses. 

External SDram provides frame buffering, which is con- 
figurable as 4MB, 8MB, 16MB or 32MB using the appro- 
priate SIMMs. Use of -10 speed grades yields an external 
buffer bandwidth of 224MB/S. The buffer provides tempo- 
rary storage of both incoming and outgoing frames. The 
protocol processor accesses the frames within the buffer in 
order to implement TCP/IP and NETBIOS. Incoming frames 
are processed, assembled then transferred to host memory 
under the control of the protocol processor. For transmit, 
data is moved from host memory to buffers where various 
headers are created before being transmitted out via the Mac. 

FIG. 21 provides an overview of the I NIC hardware. 

The following Cores/Cells form the INIC: LSI Logic 
Ethemet-110 Core, 100Base & lOBase Mac with Mil 
interface, LSI Logic single port Sram, triple port Sram and 
ROM available, LSI Logic PCI 66 MHz, 5V compatible I/O 
cell and LSI Logic PLL 

Table 4 outlines the INIC Die Size using an LSI Logic 
GIO process. 

TABLE 4 
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TABLE 4~ continued 



MODULE 



DESCR 



SPEED 



AREA 



PLL, 

Misc. Logic 



.5 mm 2 = 

117,260 gates at 5035 
gates/mm 2 = 



TOTAL CORE 



00.55 mm 2 
23.29 mm 2 



10 



Table 5 outlines the INIC Pin Count, from table 4 above. 



TABLE 5 



(Core side) 2 




■ 56.22 


mm 2 


Core side 




. 07.50 


mm 


Die side 


o core side + 1 .0 mm (I/O cells) « 


. 08.50 


mm 


Die area 


- 8.5 mm x 8.5 mm 


• 72.25 


mm 2 


Pads needed 


- 220 signals x 1.25 (vss, vdd) 


275 


pins 


LSI PBGA 




272 


pins 



Table 6 outlines the INIC Datapath Bandwidth 



TABLE 6 
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40 



(12 MB/s/100 Base) x 2 (full duplex) x 4 connections = 100 MB/s 

Average frame size = 512 B 

Frame rate = 100 MB/s/512 B = 195,312 frames/s 

Cpu overhead/frame - (256 B context read) + (64B header read) + 

(128 B context write) + (128B misc.) = 512 B/frame 

Total bandwidth = (512 B in) + (512 B out) + (512 B Cpu) = 

1536 B/frame 

Dram Bandwidth required = 1536 B/frame x 195,312 frames/s ° 
300 MB/s 

Dram Bandwidth @ 60 MHz = (32 bytes/167 ns) = 202 MB/s 
Dram Bandwidth @ 66 MHz = (32 bytes/150 ns) = 224 MB/s 
PCI Bandwidth required - 100 MB/s 

PCI Bandwidth available @ 30 MHz, 32 b, average = 46 MB/s 
PCI Bandwidth available @ 33 MHz, 32 b, average - SO MB/s 
PCI Bandwidth available @ 60 MHz, 32 b, average = 92 MB/s 
PCI Bandwidth available @ 66 MHz, 32 b, average =100 MB/s 
PCI Bandwidth available @ 30 MHz, 64 b, average = 92 MB/s 
PCI Bandwidth available @ 33 MHz, 64 b, average = 100 MB/s 
PCI Bandwidth available @ 60 MHz, 64 b, average - 184 MB/s 
PCI Bandwidth available @ 66 MHz, 64 b, average = 200 MB/s 



Table 7 outlines the INIC Cpu Bandwidth 



TABLE 7 
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MODULE 



DESCR 



SPEED 



AREA 



Scratch RAM, 

WCS, 

MAP, 

ROM, 

REGs, 

Macs, 



1Kx128 sport, 
8Kx49 sport, 
128x7 sport, 
1Kx49 32col, 
512x32 tport, 
.75 mm 2 x 4 - 



4.37 ns nom., 
6.40 ns nom., 
3.50 ns nom., 
5.00 ns nam., 
6. 10 ns nom., 



06.77 mm 2 
18.29 mm 2 
00.24 mm 2 
00.45 mm 2 
03.49 mm 2 
03 JO mm 2 
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Receive frame interval = 512 B/40 MB/s = 10.24 us 
Instructions/frame @ 60 MHz = (10.24 us/frame)/(50 ns/instruction) = 
205 

Instructions/frame @ 66 MHz = (10.24 us/frame)/(45 ns/instruction) = 
228 

Required instructions/frame = 250 



The following hardware features enhance INIC perfor- 
mance: 512 registers afford reduced scratch ram accesses 
and reduced instructions, register windowing eliminates 
context-switching overhead, separate instruction and data 
paths eliminate memory contention, resident control store 
eliminates stalling during instruction fetch, multiple logical 
processors eliminate context switching and improve real- 
time response, pipelined architecture increases operating 
frequency, shared register and scratch ram improve inter- 
processor communication, fly-by state-machine assists 
address compare and checksum calculation, TCP/IP-context 
caching reduces latency, hardware implemented queues 
reduce CPU overhead and latency, horizontal microcode 
greatly improves instruction efficiency, automatic frame 
DMA and status between MAC and DRAM buffer, deter- 
ministic architecture coupled with context switching elimi- 
nates processor stalls. 



08/15/2003, 



EAST Version: 1.04.0000 



79 



US 6 : 434,620 Bl 



The INIC processor is a convenient means to provide a 
programmable state-machine which is capable of processing 
incoming frames, processing host commands, directing net- 
work traffic and directing PCI bus traffic. Three processors 
are implemented using shared hardware in a three-level 
pipelined architecture which launches and completes a 
single instruction for every clock cycle. The instructions are 
executed in three distinct phases corresponding to each of 
the pipeline stages where each phase is responsible for a 
different function. 

The first instruction phase writes the instruction results of 
the last instruction to the destination operand, modifies the 
program counter (Pc), selects the address source for the 
instruction to fetch, then fetches the instruction from the 
control store. The fetched instruction is then stored in the 
instruction register at the end of the clock cycle. 

The processor instructions reside in the on-chip control- 
store, which is implemented as a mixture of ROM and 
SRAM. The ROM contains IK instructions starting at 
address 0x0000 and aliases each 0x0400 locations through- 
out the first 0x8000 of instruction space. The Sram (WCS) 
will hold up to 0x2000 instructions starting at address 
0x8000 and aliasing each 0x2000 locations throughout the 
last 0x8000 of instruction space. The ROM and Sram are 
both 49-bits wide accounting for bits [48:0] of the instruc- 
tion microword. A separate mapping ram provides bits 
[55:49] of the microword (MapAddr) to allow replacement 
of faulty ROM based instructions. The mapping ram has a 
configuration of 128x7 which is insufficient to allow a 
separate map address for each of the IK ROM locations. To 
allow re-mapping of the entire IK ROM space, the map ram 
address lines are connected to the address bits Fetch[9:3]. 
The result is that the ROM is re-mapped in blocks of 8 
contiguous locations. 

The second instruction phase decodes the instruction 
which was stored in the instruction register. It is at this point 
that the map address is checked for a non-zero value which 
will cause the decoder to force a Jmp instruction to the map 
address. If a non-zero value is detected then the decoder 
selects the source operands for the Alu operation based on ■ 
the values of the OpdASel, OpdBSel and AluOp fields. 
These operands are then stored in the decode register at the 
end of the clock cycle. Operands may originate from File, 
Sram, or flip-flop based registers. The second instruction 
phase is also where the results of the previous instruction are 
written to the Sram. 

The third instruction phase is when the actual Alu opera- 
tion is performed, the test condition is selected and the Stack 
push and pop are implemented. Results of the Alu operation 
are stored in the results register at the end of the clock cycle. 

FIG- 22 shows an overview of the pipelined micropro- 
cessor 470, in which instructions for the receive, transmit 
and utility processors are executed in three alternating 
phases according to Clock increments I, n and III, the phases 
corresponding to each of the pipeline stages. Each phase is 
responsible for different functions, and each of the three 
processors occupies a different phase during each Clock 
increment. Each processor usually operates upon a different 
instruction stream from the control store 480, and each 
carries its own program counter and status through each of 
the phases. 

In general, a first instruction phase 500 of the pipelined 
microprocessors completes an instruction and stores the 
result in a destination operand, fetches the next instruction, 
and stores that next instruction in an instruction register. A 65 
first register set 490 provides a number of registers including 
the instruction register, and a set of controls 492 for first 



80 



25 



45 



50 



55 



60 



register set provides the controls for storage to the first 
register set 490. Some items pass through the first phase 
without modification by the controls 492, and instead are 
simply copied into the first register set 490 or a RAM file 
register 533. Asecond instruction phase 560 has an instruc- 
tion decoder and operand multiplexer 498 that generally 
decodes the instruction that was stored in the instruction 
register of the first register set 490 and gathers any operands 
which have been generated, which are then stored in a 
decode register of a second register set 496. The first register 
set 490, second register set 496 and a third register set 501, 
which is employed in a third instruction phase 600, include 
many of the same registers, as will be seen in the more 
detailed views of FIGS. 15A-C. The instruction decoder and 
operand multiplexer 498 can read from two address and data 
ports of the RAM file register 533, which operates in both 
the first phase 500 and second phase 560. A third phase 600 
of the processor 470 has an arithmetic logic unit (ALU) 602 
which generally performs any ALU operations on the oper- 
ands from the second register set, storing the results in a 
results register included in the third register set 501. A stack 
exchange 608 can reorder register stacks, and a queue 
manager 503 can arrange queues for the processor 470, the 
results of which are stored in the third register set. 

The instructions continue with the first phase then fol- 
lowing the third phase, as depicted by a circular pipeline 
505. Note that various functions have been distributed 
across the three phases of the instruction execution in order 
to minimize the combinatorial delays within any given 
) phase. With a frequency in this embodiment of 66 MHz, 
each Clock increment takes 15 nanoseconds to complete, for 
a total of 45 nanoseconds to complete one instruction for 
each of the three processors. The rotating instruction phases 
are depicted in more detail in FIGS. 15A-C, in which each 
; phase is shown in a different figure. 

More particularly, FIG. 23A shows some specific hard- 
ware functions of the first phase 500, which generally 
includes the first register set 490 and related controls 492. 
The controls for the first register set 492 includes an SRAM 
control 502, which is a logical control for loading address 
and write data into SRAM address and data registers 520. 
Thus the output of the ALU 602 from the third phase 600 
may be placed by SRAM control 502 into an address register 
or data register of SRAM address and data registers 520. A 
load control 504 similarly provides controls for writing a 
context for a file to file context register 522, and another load 
control 506 provides controls for storing a variety of mis- 
cellaneous data to flip-flop registers 525. ALU condition 
codes, such as whether a carried bit is set, get clocked into 
ALU condition codes register 528 without an operation 
performed in the first phase 500. Flag decodes 508 can 
perform various functions, such as setting locks, that get 
stored in flag registers 530. 

The RAM file register 533 has a single write port for 
addresses and data and two read ports for addresses and data, 
so that more than one register can be read from at one time. 
As noted above, the RAM file register 533 essentially 
straddles the first and second phases, as it is written in the 
first phase 500 and read from in the second phase S60. A 
control store instruction 510 allows the reprogramming of 
the processors due to new data in from the control store 480, 
not shown in this figure, the instructions stored in an 
instruction register 535. The address for this is generated in 
a fetch control register 511, which determines which address 
to fetch, the address stored in fetch address register 538. 
Load control 515 provides instructions for a program 
counter 540, which operates much like the fetch address for 
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the control store. A last-in first-out stack 544 of three 
registers is copied to the first register set without undergoing 
other operations in this phase. Finally, a load control 517 for 
a debug address 548 is optionally included, which allows 
correction of errors that may occur. s 

FIG. 23B depicts the second microprocessor phase 560, 
which includes reading addresses and data out of the RAM 
file register 533. A scratch SRAM 565 is written from 
SRAM address and data register 520 of the first register set, 
which includes a register that passes through the first two 
phases to be incremented in the third. The scratch SRAM 1 
565 is read by the instruction decoder and operand multi- 
plexer 498, as are most of the registers from the first register 
set, with the exception of the stack 544, debug address 548 
and SRAM address and data register mentioned above. The 
instruction decoder and operand multiplexer 498 looks at the ' S 
various registers of set 490 and SRAM 565, decodes the 
instructions and gathers the operands for operation in the 
next phase, in particular determining the operands to provide 
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another debug address 642 may be forced from the pipeline 
505 at this point in order to allow error control in this phase 
also. A QRAM & QALU 606, shown together in this figure, 
read from the queue channel and command register 587, 
store in SRAM and rearrange queues, adding or removing 
data and pointers as needed to manage the queues of data, 
sending results to the test multiplexer 604 and a queue flags 
and queue address register 628. Thus the QRAM & QALU 
606 assume the duties of managing queues for the three 
processors, a task conventionally performed sequentially by 
software on a CPU, the queue manager 606 instead provid- 
ing accelerated and substantially parallel hardware queuing. 

The micro-instructions are divided into six types accord- 
ing to the program control directive. The micro-instruction 
is further divided into sub-fields for which the definitions are 
dependent upon the instruction type. The word format for 
the six instruction types are listed in Table 8 below. 



TABLE 8 



TYPE _[55:491_ [48:47] [46:42] _[41:33]_ _[32:24]_ [23:16] [15:00] 

Jcc ObOOOOOOO ObOO, AhiOp, OpdASel, OpdBScl, TstScl, Literal 

Jmp ObOOOOOOO ObOl, AluOp, OpdASel, OpdBSel, FlgSel, Literal 

Jsr ObOOOOOOO OblO, AhiOp, OpdASel, OpdBSel, FlgSel, Literal 

Rls ObOOOOOOO Obll, AluOp, OpdASel, OpdBSel, Obff, literal 

Nxt ObOOOOOOO Obll, AluOp, OpdASel, OpdBSel, FlgSel, Literal 

Map MapAddr ObXX, ObXXXXX, ObXXXXXXXXX, ObXXXXXXXXX, OhXX, OhXXXX 



to the ALU 602 below. The outcome of the instruction 
decoder and operand multiplexer 498 is stored to a number 
of registers in the second register set 496, including ALU 
operands 579 and 582, ALU condition code register 580, and 
a queue channel and command 587 register, which in this 
embodiment can control thirty-two queues. Several of the 
registers in set 496 are loaded fairly directly from the 
instruction register 535 above without- substantia] decoding 
by the decoder 498, including a program control 590, a 
literal field 589, a test select 584 and a flag select 585. Other 
registers such as the file context 522 of the first phase 500 
are always stored in a file context 577 of the second phase 
560, but may also be treated as an operand that is gathered 
by the multiplexer 572. The stack registers 544 are simply 
copied in stack register 594. The program counter 540 is 
incremented 568 in this phase and stored in register 592. 
Also incremented 570 is the optional debug address 548, and 
a load control 575 may be fed from the pipeline 505 at this 
point in order to allow error control in each phase, the result 
stored in debug address 598. 

FIG. 23C depicts the third microprocessor phase 600, 
which includes ALU and queue operations. The ALU 602 
includes an adder, priority encoders and other standard logic 
functions. Results of the ALU are stored in registers ALU 
output 618, ALU condition codes 620 and destination oper- 
and results 622. A file context register 616, flag select 
register 626 and literal field register 630 are simply copied 
from the previous phase 560. A test multiplexer 604 is 
provided to determine whether a conditional jump results in 
a jump, with the results stored in a test results register 624. 
The test multiplexer 604 may instead be performed in the 
first phase 500 along with similar decisions such as fetch 
control 511. A stack exchange 608 shifts a stack up or down 
by fetching a program counter from stack 594 or putting a 
program counter onto that stack, results of which are stored 
in program control 634, program counter 638 and stack 640 
registers. The SRAM address may optionally be incre- 
mented in this phase 600. Another load control 610 for 
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All instructions include the Alu operation (AluOp), oper- 
and "A" select (OpdASel), operand "B" select (OpdBSel) 
and Literal fields. Other field usage depends upon the 
instruction type. 

3$ The "jump condition code" (Jcc) instruction causes the 
program counter to be altered if the condition selected by the 
"test select" (TstSel) field is asserted. The new program 
counter (Pc) value is loaded from either the Literal field or 
the AluOut as described in the following section and the 
Literal field may be used as a source for the Alu or the ram 

40 address if the new Pc value is sourced by the Alu. 

The "jump" (Jmp) instruction causes the program counter 
to be altered unconditionally. The new program counter (Pc) 
value is loaded from either the Literal field or the AluOut as 
described in the following section. The format allows 

45 instruction bits 23:16 to be used to perform a flag operation 
and the Literal field may be used as a source for the Alu or 
the ram address if the new Pc value is sourced by the Alu. 

The "jump subroutine" (Jsr) instruction causes the pro- 
gram counter to be altered unconditionally. The new pro- 

50 gram counter (Pc) value is loaded from either the Literal 
field or the AluOut as described in the following section. The 
old program counter value is stored on the top location of the 
Pc-Stack which is implemented as a LIFO memory. The 
format allows instruction bits 23:16 to be used to perform a 

SJ flag operation and the Literal field may be used as a source 
for the Alu or the ram address if the new Pc value is sourced 
by the Alu. 

The "Nxt" (Nxt) instruction causes the program counter to 
increment. The format allows instruction bits 23:16 to be 
used to perform a flag operation and the Literal field may be 
60 used as a source for the Ahi or the ram address. 

The "return from subroutine" (Rts) instruction is a special 
form of the Nxt instruction in which the "flag operation" 
(FlgSel) field is set to a value of Ohff. The current Pc value 
is replaced with the last value stored in the stack. The Literal 
65 field may be used as a source for the Alu or the ram address. 
The Map instruction is provided to allow replacement of 
instructions which have been stored in ROM and is imple- 
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merited any time the "map enable" (MapEn) bil has been set 
and the content of the "map address" (MapAddr) field is 
non-zero. The instruction decoder forces a jump instruction 
with the Alu operation and destination fields set to pass the 
MapAddr field to the program control block. 

The program control is determined by a combination of 
PgmCtrl, DstOpd, FlgSel and TstSel. The behavior of the 
program control is illustrated in the "C-like" description 
contained in CD Appendix A. 

Hardware will detect certain program errors. Any 
sequencer generating a program error will be forced to 
continue executing from location 0004. The program errors 
detected are: 

1. Stack Overflow — A JSR is attempted and the stack 
registers are full. 

2. Stack Underflow — An RTS is attempted and the stack 
registers are empty. 

3. Incompatible Sram Size & Sram Alignment — An Sram 
Operation is attempted where the size and the Sram address 
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alignment. Sram outputs are fed to the output aligner via a 
register. Requests are acknowledged in parallel with the 
returned data. FIG. 27 is a timing diagram depicting two ram 
accesses during a single 66 MHz clock cycle. 
5 FIG. 28 is a diagram of an EXTERNAL MEMORY 
CONTROL (Xctrl). Xctrl provides the facility whereby 
Xwr, Xrd, Dcfg and Eectrl access external Flash and Dram. 
Xctrl includes an arbiter, i/o registers, data multiplexers, 
address multiplexers and control multiplexers. Ownership of 
to the external memory interface is requested by each block 
and granted to each of the requesters by the arbiter function. 
Once ownership has been granted the multiplexers select the 
address, data and control signals from owner, allowing 
access to external memory, 
is FIG. 30 is a diagram of an EXTERNAL MEMORY 
READ SEQUENCER (Xrd). The Xrd sequencer acts only as 
a slave sequencer. Servicing requests issued by master 
sequencers, the Xrd sequencer moves data from external 
sdram or flash to the Sram, via the Xctrl module, in blocks 



would cause the operation to extend beyond the size of the 20 of 32 bytes or less. The nature of the sdram requires fixed 
word, e.g. Size=4 Address=401 or Size=2 Address=563 burst sizes for each of it's internal banks with ras precharge 

4. A Sram read is attempted immediately following an intervals between each access. By selecting a burst size of 32 
Sram write. Because an Sram write is actually done in the bytes for sdram reads and interleaving bank accesses on a 16 
clock cycle of the following instruction, the sram interface byte boundary, we can ensure that the ras precharge interval 
will be busy during that phase, and an Sram read is illegal 25 for the first bank is satisfied before burst completion for the 
at this time. second bank, allowing us to re-instruct the first bank and 

Sequencer behavior is described with in CD Appendix A. continue with uninterrupted dram access. Sdrams require a 
FIG. 24 is a diagram of various sequencers contained in consistent burst size be utilized each and every time the 

sdram is accessed. For this reason, if an sdram access does 
30 not begin or end on a 32 byte boundary, sdram bandwidth 
will be reduced due to less than 32 bytes of data being 
transferred during the burst cycle. 

A first step in servicing a request to move data from sdram 
to Sram is the prioritization of the master sequencer 



the INIC with arrows representing the flow of data therebe 
tween. Request information such as r/w, address, size, 
endian and alignment are represented by each request line. 
Acknowledge information to master sequencers include 
only the size of the transfer being acknowledged. 

FIG. 25 illustrates how data movement is accomplished 



for a Pci slave write to Dram. Note that the Psi (Pci slave in) 35 requests. Next the Xrd sequencer takes a snapshot of the 

dram read address and applies configuration information to 
determine the correct bank, row and column address to 
apply. Once sufficient data has been read, the Xrd sequencer 
issues a write request to the SramCtrl sequencer which in 



module functions as both a master sequencer. Psi sends a 
write request to the SramCtrl module. Psi requests Xwr to 
move data from Sram to dram. Xwr subsequently sends a 
read request to the SramCtrl module then writes the data to 



the dram via the Xctrl module. As each piece of data is 40 turn sends an acknowledge to the Xrd sequencer. The Xrd 
. . ^ . . . sequencer passes the acknowledge along to the level two 

master with a size code indicating how much data was 
written during the Sram cycle allowing the update of point- 
ers and counters. The dram read and Sram write cycles 



moved from the Sram to Xwr, Xwr sends an acknowledge to 
the Psi module. 

FIG. 26 is a diagram of an SRAM CONTROL 
SEQUENCER (SramCtrl). Sram is the nexus for data move- 



ment within the microprocessor. A hierarchy of sequencers, 45 repeat until the original burst request has been completed at 

wmcn pomt me Xrd sequencer prioritizes any remaining 
requests in preparation for the next burst cycle. 

Contiguous dram burst cycles are not guaranteed to the 
Xrd sequencer as an algorithm is implemented which 
so ensures highest priority to refresh cycles followed by flash 
accesses, dram writes then dram reads. 

FIG. 29 is a timing diagram illustrating how data is read 
from sdram. The dram has been configured for a burst of 
four with a latency of two clock cycles. Bank A is first 
55 selected/activated followed by a read command two clock 
cycles later. The bank select/activate for bank B is next 
issued as read data begins returning two clocks after the read 
command was issued to bank A. Two clock cycles before we 
need to receive data from bank B we issue the read com- 
60 mand. Once all 16 bytes have been received from bank Awe 
begin receiving data from bank B. 

FIG. 32 depicts the major functional blocks of the 
EXTERNAL MEMORY WRITE SEQUENCER (Xwr). The 
Xwr sequencer is a slave sequencer. Servicing requests 



working in concert, accomplish the movement of data 
between dram, Sram, Cpu, ethernet and the Pci bus. Slave 
sequencers, provided with stimulus from master sequencers, 
request data movement operations by way of the Sram, Pci 
bus, Dram and Flash. The slave sequencers prioritize, ser- 
vice and acknowledge the requests 

The Sram control sequencer services requests to store to, 
or retrieve data from an Sram organized as 1024 locations by 
128 bits (16KB). The sequencer operates at a frequency of 
133 MHz, allowing both a Cpu access and a dma access to 
occur during a standard 66 MHz Cpu cycle. One 133 MHz 
cycle is reserved for Cpu accesses during each 66 MHz cycle 
while the remaining 133 MHz cycle is reserved for dma 
accesses on a prioritized basis. 

FIG. 26 shows the major functions of the Sram control 
sequencer. A slave sequencer begins by asserting a request 
along with r/w, ram address, endian, data path size, data path 
alignment and request size. SramCtrl prioritizes the requests. 
The request parameters are then selected by a multiplexer 



which feeds the parameters to the Sram via a register. The 65 issued by master sequencers, the Xwr sequencer moves data 
requester provides the Sram address which when coupled from Sram to the external sdram or flash, via the Xctrl 
with the other parameters controls the input and output module, in blocks of 32 bytes or less while accumulating a 
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checksum of the data moved. The nature of the sdram 
requires fixed burst sizes for each of its internal banks with 
ras precharge intervals between each access. By selecting a 
burst size of 32 bytes for sdram writes and interleaving bank 
accesses on a 16 byte boundary, we can ensure that the ras 
precharge interval for the first bank is satisfied before burst 
completion for the second bank, allowing us to re-instruct 
the first bank and continue with uninterrupted dram access. 
Sdrams require a consistent burst size be utilized each and 
every time the sdram is accessed. For this reason, if an sdram 
access does not begin or end on a 32-byte boundary, sdram 
bandwidth will be reduced due to less than 32 bytes of data 
being transferred during the burst cycle. 

The first step in servicing a request to move data from 
Sram to sdram is the prioritization of the level two master 
requests. Next the Xwr sequencer takes a Snapshot of the 
dram write address and applies configuration information to 
determine the correct dram, bank, row and column address 
to apply. The Xwr sequencer immediately issues a read 
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uses this to generate read requests for the SramCtrl 
sequencer. The Pmo module then proceeds to arbitrate for 
ownership of the Pci bus via the PciMstrlO module. Once 
the Pmo holding registers have sufficient data and Pci bus 
mastership has been granted, the Pmo module begins trans- 
ferring data to the Pci target. For each successful transfer, 
Pmo sends an acknowledge and encoded size to the master 
sequencer, allow it to update it's internal pointers, counters 
and status. Once the Pci burst transaction has terminated, 
Pmo parks on the Pci bus unless another initiator has 
requested ownership. Pmo again prioritizes the incoming 
requests and repeats the process. 

FIG. 34 is a diagram of a PCI MASTER-IN 
SEQUENCER (Pmi). The Pmi sequencer acts only as a 
slave sequencer. Servicing requests issued by master 
sequencers, the Pmi sequencer moves data from a Pci target 
to an Sram based fifo, via the PciMstrlO module, in bursts 
of up to 256 bytes. The nature of the PCI bus dictates the use 
of the read multiple command to ensure optimal system 
performance. The read multiple command requires that the 



command to the Sram to which the Sram responds with both 20 Pmi sequencer be capable of transferring a cache line or 

data and an acknowledge. The Xwr sequencer passes the more of data. To accomplish this end, Pmi will automatically 

acknowledge to the level two master along with a size code perform partial cache line bursts until it has aligned the 

indicating how much data was read during the Sram cycle transfers on a cache line boundary at which time it will begin 

allowing the update of pointers and counters. Once sufficient usage of the read multiple command. The Sram fifo depth 

data has been read from Sram, the Xwr sequencer issues a 25 of 256 bytes, has been chosen in order to allow Pmi to 
write command to the dram starting the burst cycle and 



computing a checksum as the data flies by. The Sram read 
cycle repeats until the original burst request has been 
completed at which point the Xwr sequencer prioritizes any 
remaining requests in preparation for the next burst cycle. 

Contiguous dram burst cycles are not guaranteed to the 
Xwr sequencer as an algorithm is implemented which 
ensures highest priority to refresh cycles followed by flash 
accesses then dram writes. 



accommodate cache line sizes up to 128 bytes. Provided the 
cache line size is less than 128 bytes, Pmi will perform 
multiple, contiguous cache line bursts until it has filled the 
fifo. 

30 Pmi receive requests from two separate sources; the Pci to 
dram (P2d) module and the Pci to Sram (P2s) module. An 
operation first begins with prioritization of the requests 
where the P2s module is given highest priority. The Pmi 
module then proceeds to arbitrate for ownership of the Pci 
FIG. 31 is a timing diagram illustrating how data is 35 bus via the PciMstrlO module. Once the Pci bus mastership 
written to sdram. The dram has been configured for a burst has been granted and the Pmi holding registers have suffi- 
of four with a latency of two clock cycles. Bank A is first cient data, the Pmi module begins transferring data to the 
selected/activated followed by a write command two clock Sram fifo. For each successful transfer, Pmi sends an 
cycles later. The bank select/activate for bank B is next acknowledge and encoded size to the master sequencer 
issued in preparation for issuing the second write command. 40 allowing it to update it's internal pointers, counters and 
As soon as the first 16 byte burst to bank A completes we status. Once the Pci burst transaction has terminated Pmi 
issue the write command for bank B and begin supplying parks on the Pci bus unless another initiator has requested 
dala- ownership. Pmi again prioritizes the incoming requests and 

A PCI MASTER-OUT SEQUENCER (Pmo) is shown in repeats the process. 
FIG. 33. The Pmo sequencer acts only as a slave sequencer. 45 
Servicing requests issued by master sequencers, the Pmo 
sequencer moves data from an Sram based fifo to a Pci 
target, via the PciMstrlO module, in bursts of up to 256 
bytes. The nature of the PCI bus dictates the use of the write 

fine command to ensure optimal system performance. The so Pmo sequencer. Data transfer is accomplished using an Sram 
write line command requires that the Pmo sequencer be based fifo through which data is staged 
capable of transferring a whole multiple (IX, 2X, 3X, . . . ) D2p can receive requests from any of the processor's 
of cache lmes of which the size is set through the Pci thirty-two dma channels. Once a command request has been 
configuration registers. To accomplish this end, Pmo will detected, . D2p fetches a dma descriptor from an Sram 
automatically perform partial bursts until it has aligned the 55 location dedicated to the requesting channel which includes 
transfers on a cache line boundary at which time it will begin the dram address, Pci address, Pci endian and request size 
usage of the write line command. The Sram fifo depth, of D2p then issues a request to the D2s sequencer causing the 
256 bytes, has been chosen in order to allow Pmo to Sram based fifo to fill with dram data. Once the fifo contains 
accommodate cache line sizes up to 128 bytes. Provided the sufficient data for a Pci transaction, D2s issues a request to 

60 Pmo which in turn moves data from the fifo to a Pci target. 
The process repeats until the entire request has been satisfied 
at which time D2p writes ending status in to the Sram dma 
descriptor area and sets the channel done bit associated with 
that channel. D2p then monitors the dma channels for 
An operation first begins with prioritization of the requests 65 additional requests, 
where the S2p module is given highest priority. Next, the FIG. 35 is an illustration showing the major blocks 
Pmo module takes a Snapshot of the Sram fifo address and involved in the movement of data from dram to Pci target 



FIG. 36 is a diagram of a Dram TO PCI SEQUENCER 
(D2p). The D2p sequencer acts as a master sequencer. 
Servicing channel requests issued by the Cpu, the D2p 
sequencer manages movement of data from dram to the Pci 
bus by issuing requests to both the Xrd sequencer and the 



cache line size is less than 128 bytes, Pmo will perform 
multiple, contiguous cache fine bursts until it has exhausted 
the supply of data. 

Pmo receives requests from two separate sources; the 
dram to Pci (D2p) module and the Sram to Pci (S2p) module. 
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FIG. 38 is a diagram of a PCI TO DRAM SEQUENCER 
(P2d). The P2d sequencer acts as both a slave sequencer and 
a master sequencer. Servicing channel requests issued by the 
Cpu, the P2d sequencer manages movement of data from Pci 
bus to dram by issuing requests to both the Xwr sequencer s 
and the Pmi sequencer. Data transfer is accomplished using 
an Sram based fifo through which data is staged. 

P2d can receive requests from any of the processor's 
thirty-two dm a channels. Once a command request has been 
detected, P2d, operating as a slave sequencer, fetches a dma to 
descriptor from an Sram location dedicated to the requesting 
channel which includes the dram address, Pci address, Pci 
endian and request size. P2d then issues a request to Pmo 
which in turn moves data from the Pci target to the Sram fifo. 
Next, P2d issues a request to the Xwr sequencer causing the 15 
Sram based fifo contents to be written to the dram. The 
process repeats until the entire request has been satisfied at 
which time P2d writes ending status in to the Sram dma 
descriptor area and sets the channel done bit associated with 
that channel. P2d then monitors the dma channels for 20 
additional requests. 

FIG. 37 is an illustration showing the major blocks 
involved in the movement of data from a Pci target to dram. 
FIG. 40 is a diagram of a SRAM TO PCI SEQUENCER 
(S2p). The S2p sequencer acts as both a slave sequencer and 25 
a master sequencer. Servicing channel requests issued by the 
Cpu, the S2p sequencer manages movement of data from 
Sram to the Pci bus by issuing requests to the Pmo sequencer 
S2p can receive requests from any of the processor's 
thirty-two dma channels. Once a command request has been 30 
detected, S2p, operating as a slave sequencer, fetches a dma 
descriptor from an Sram location dedicated to the requesting 
channel which includes the Sram address, Pci address, Pci 
endian and request size. S2p then issues a request to Pmo 
which in turn moves data from the Sram to a Pci target. The 35 
process repeats until the entire request has been satisfied at ' 
which time S2p writes ending status in to the Sram dma 
descriptor area and sets the channel done bit associated with 
that channel. S2p then monitors the dma channels for 
additional requests. 40 

FIG. 39 is an illustration showing the major blocks 
involved in the movement of data from Sram to Pci target. 

FIG. 42 is a diagram of a Pa TO SRAM SEQUENCER 
(P2s). The P2s sequencer acts as both a slave sequencer and 
a master sequencer. Servicing channel requests issued by the 45 
Cpu, the P2s sequencer manages movement of data from Pci 
bus to Sram by issuing requests to the Pmi sequencer. 

P2s can receive requests from any of the processor's 
thirty-two dma channels. Once a command request has been 
detected, P2s, operating as a slave sequencer, fetches a dma 50 
descriptor from an Sram location dedicated to the requesting 
channel which includes the Sram address, Pci address, Pci 
endian and request size. P2s then issues a request to Pmo 
which in turn moves data from the Pci target to the Sram. 
The process repeats until the entire request has been satisfied 55 
at which time P2s writes ending status in to the dma 
descriptor area of Sram and sets the channel done bit 
associated with that channel. P2s then monitors the dma 
channels for additional requests. 

FIG. 41 is an illustration showing the major blocks 60 
involved in the movement of data from a Pci target to dram. 

FIG. 44 is a diagram of a DRAM TO SRAM 
SEQUENCER (D2s). The D2s sequencer acts as both a 
slave sequencer and a master sequencer. Servicing channel 
requests issued by the Cpu, the D2s sequencer manages 65 
movement of data from dram to Sram by issuing requests to 
the Xrd sequencer. 
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D2s can receive requests from any of the processor's 
thirty-two dma channels. Once a command request has been 
detected, D2s, operating as a slave sequencer, fetches a dma 
descriptor from an Sram location dedicated to the requesting 
channel which includes the dram address, Sram address and 
request size. D2s then issues a request to the Xrd sequencer 
causing the transfer of data to the Sram. The process repeats 
until the entire request has been satisfied at which time D2s 
writes ending status in to the Sram dma descriptor area and 
sets the channel done bit associated with that channel. D2s 
then monitors the dma channels for additional requests. 

FIG. 43 is an illustration showing the major blocks 
involved in the movement of data from dram to Sram. 

FIG. 46 is a diagram of a SRAM TO DRAM 
SEQUENCER (S2d). The S2d sequencer acts as both a slave 
sequencer and a master sequencer. Servicing channel 
requests issued by the Cpu, the S2d sequencer manages 
movement of data from Sram to dram by issuing requests to 
the Xwr sequencer. 

S2d can receive requests from any of the processor's 
thirty-two dma channels. Once a command request has been 
detected, S2d, operating as a slave sequencer, fetches a dma 
descriptor from an Sram location dedicated to the requesting 
channel which includes the dram address, Sram address, 
checksum reset and request size. S2d then issues a request 
to the Xwr sequencer causing the transfer of data to the 
dram. The process repeats until the entire request has been 
satisfied at which time S2d writes ending status in to the 
Sram dma descriptor area and sets the channel done bit 
associated with that channel. S2d then monitors the dma 
channels for additional requests. 

FIG. 45 is an illustration showing the major blocks 
involved in the movement of data from Sram to dram. FIG. 
47 depicts a sequence of events when a PCI SLAVE INPUT 
SEQUENCER (Psi) is the target of a Pci write operation. 
The Psi sequencer acts as both a slave sequencer and a 
master sequencer. Servicing requests issued by a Pci master, 
the Psi sequencer manages movement of data from Pci bus 
to Sram and Pci bus to dram via Sram by issuing requests to 
the SramCtrl and Xwr sequencers. 

Psi manages write requests to configuration space, expan- 
sion rom, dram, Sram and memory mapped registers. Psi 
separates these Pci bus operations in to two categories with 
different action taken for each. Dram accesses result in Psi 
generating write request to an Sram buffer followed with a 
write request to the Xwr sequencer. Subsequent write or read 
dram operations are retry terminated until the buffer has 
been emptied. An event notification is set for the processor 
allowing message passing to occur through dram space. 

All other Pci write transactions result in Psi posting the 
write information including Pci address, Pci byte marks and 
Pci data to a reserved location in Sram, then setting an event 
flag which the event processor monitors. Subsequent writes 
or reads of configuration, expansion rom, Sram or registers 
are terminated with retry until the processor clears the event 
flag. This allows SiMBa to keep pipelining levels to a 
minimum for the posted write and give the processor ample 
time to modify data for subsequent Pci read operations. Note 
that events 4 through 7 occur only when the write operation 
targets the dram. 

FIG. 48 depicts the sequence of events when a PCI 
SLAVE OUTPUT SEQUENCER (Pso) is the target of a Pci 
read operation. The Pso sequencer acts as both a slave 
sequencer and a master sequencer. Servicing requests issued 
by a Pci master, the Pso sequencer manages movement of 
data to Pci bus form Sram and to Pci bus from dram via Sram 
by issuing requests to the SramCtrl and Xrd sequencers. 
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Pso manages read requests to configuration space, expan- 
sion rom, dram, Sram and memory mapped registers. Pso 
separates these Pci bus operations in to two categories with 
different action taken for each. Dram accesses result in Pso 
generating read request to the Xrd sequencer followed with 5 
a read request to Sram buffer. Subsequent write or read dram 
operations are retry terminated until the buffer has been 
emptied. 

All other Pci read transactions result in Pso posting the 
read request information including Pci address and Pci byte 
marks to a reserved location in Sram, then setting an event 10 
flag which the event processor monitors. Subsequent writes 
or reads of configuration, expansion rom, Sram or registers 
are terminated with retry until the processor clears the event 
flag. This allows SiMBa to use a microcoded response 
mechanism to return data for the request. The processor 1S 
decodes the request information, formulates or fetches the 
requested data and stores it in Sram then clears the event flag 
allowing Pso to fetch the data and return it on the Pci bus. 

FIG. 50 is a diagram of a FRAME RECEIVE 
SEQUENCER (RcvX). The receive sequencer (RcvSeq) 20 
analyzes and manages incoming packets, stores the result in 
dram buffers, then notifies the processor through the receive 
queue (RcvQ) mechanism. The process begins when a buffer 
descriptor is available at the output of the FreeQ. RcvSeq 
issues a request to the Qmg which responds by supplying the 25 
buffer descriptor to RcvSeq. RcvSeq then waits for a receive 
packet. The Mac, network, transport and session information 
is analyzed as each byte is received and stored in the 
assembly register (AssyReg). When fourbytes of informa- 
tion is available, RcvSeq requests a write of the data to the 30 
Sram. When sufficient data has been stored in the Sram 
based receive fifo, a dram write request is issued to Xwr. The 
process continues until the entire packet has been received 
at which point RcvSeq stores the results of the packet 
analysis in the beginning of the dram buffer. Once the buffer 35 
and status have both been stored, RcvSeq issues a write- 
queue request to Qmg. Qmg responds by storing a buffer 
descriptor and a status vector provided by RcvSeq. The 
process then repeats. If RcvSeq detects the arrival of a 
packet before a free buffer is available, it ignores the packet 40 
and sets the FrameLost status bit for the next received 
packet. 

FIG. 49 depicts a sequence of events for successful 
reception of a packet followed by a definition of the receive 
buffer and the buffer descriptor as stored on the RcvQ. 45 

CD Appendix B defines various bits of control informa- 
tion relating to receive packets. 

FIG. 52 is a diagram of a FRAME TRANSMIT 
SEQUENCER (XmtX). The transmit sequencer (XmtSeq) 
analyzes and manages outgoing packets, using buffer 50 
descriptors retrieved from the transmit queue (XmtQ) then 
storing the descriptor for the freed buffer in the free buffer 
queue (FreeQ). The process begins when a buffer descriptor 
is available at the output of the XmtQ. XmtSeq issues a 
request to the Qmg which responds by supplying the buffer 55 
descriptor to XmtSeq. XmtSeq then issues a read request to 
the Xrd sequencer. Next, XmtSeq issues a read request to 
SramCtrl then instructs the Mac to begin frame transmis- 
sion. Once the frame transmission has completed, XmtSeq 
stores the buffer descriptor on the FreeQ thereby recycling 60 
the buffer. 

FIG. 51 depicts a sequence of events for successful 
transmission of a packet followed by a definition of the 
receive buffer and the buffer descriptor as stored on the 
XmtQ. 

CD Appendix C defines various bits of control informa 
tioo relating to transmit packets. 
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FIG. 54 is a diagram of a QUEUE MANAGER (Qmg). 
The INIC includes special hardware assist for the imple- 
mentation of message and pointer queues. The hardware 
assist is called the queue manager (Qmg) and manages the 
movement of queue entries between Cpu and Sram, between 
dma sequencers and Sram as well as between Sram and 
dram. Queues comprise three distinct entities; the queue 
head (QHd), the queue tail (QT1) and the queue body 
(QBdy). QHd resides in 64 bytes of scratch ram and pro- 
vides the area to which entries will be written (pushed). QTI 
resides in 64 bytes of scratch ram and contains queue 
locations from which entries will be read (popped). QBdy 
resides in dram and contains locations for expansion of the 
queue in order to minimize the Sram space requirements. 
Tt> e QBdy size depends upon the queue being accessed and 
the initialization parameters presented during queue initial- 
ization. 

FIG. 53 is a timing diagram for the Qmg, which accepts 
operations from both Cpu and dma sources. Executing these 
operations at a frequency of 133 MHz, Qmg reserves even 
cycles for dma requests and reserves odd cycles for Cpu 
requests. Valid Cpu operations include initialize queue 
(InitQ), write queue (WrQ) and read queue (RdQ). Valid 
dma requests include read body (RdBdy) and write body 
(WrBdy). Qmg working in unison with Q2d and D2q 
generate requests to the Xwr and Xrd sequencers to control 
the movement of data between the QHd, QTI and QBdy. 

The arbiter selects the next operation to be performed. 
The dual-ported Sram holds the queue variables HdWrAddr 
HdRdAddr, TlWrAddr, TIRdAddr, BdyWrAddr' 
BdyRdAddr and QSz. Qmg accepts an operation request, 
fetches the queue variables from the queue ram (Qram), 
modifies the variables based on the current state and the 
requested operation then updates the variables and issues a 
read or write request to the Sram controller. The Sram 
controller services the requests by writing the tail or reading 
the head and returning an acknowledge. 

DMA operations are accomplished through a combination 
of thirtytwo dma channels (DmaCh) and seven dma 
sequencers (DmaSeq). Each dma channel provides a mecha- 
nism whereby a Cpu can issue a command to any of the 
seven dma sequencers. Whereas the dma channels are multi- 
purpose, the dma sequencers they command are single 
purpose as follows. 
Table 9 lists functions of the dma sequencers. 

TABLE 9 



DMA SEQ # 


NAME 


DESCRIPTION 


0 


none 


This is a no operation address. 


1 


D2dSeq 


Moves data from ExtMem to ExLMcra 


2 


D2sSeq 


Moves data from ExtMem bus to sram. 


3 


D2pSeq 


Moves data from ExtMem to Pci bus. 


4 


S2dSeq 


Moves data from sram to ExtMem. 


5 


S2pSeq 


Moves data from sram to Pci bus. 


6 


P2dSeq 


Moves data from Pci bus to ExtMem. 


7 


P2sScq 


Moves data from Pci bus to sram. 
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The processors manage dma in the following way. The 
processor writes a dma descriptor to an Sram location 
reserved for the dma channel. The format of the dma 
descriptor is dependent upon the targeted dma sequencer. 
The processor then writes the dma sequencer number to the 
channel command register. 

Each of the dma sequencers polls all thirty two dma 
channels in search of commands to execute. Once a com- 
mand request has been detected, the dma sequencer fetches 
a dma descriptor from a fixed location in Sram. The Sram 
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location is fixed and is determined by the dma channel 
number. The dma sequencer loads the dma descriptor in to 
it's own registers, executes the command, then overwrites 
the dma descriptor with ending status. Once the command 
has halted, due to completion or error, and the ending status 
has been written, the dma sequencer sets the done bit for the 
current dma channel. 

The done bit appears in a dma event register which the 
Cpu can examine. The Cpu fetches ending status from Sram, 
then clears the done bit by writing zeroes to the channel 
command (ChCmd) register. The channel is now ready to 
accept another command. 

CD Appendix D defines various bits of control informa- 
tion relating to dma operations. 

What is claimed is: 

1. A system, comprising: 

a processor that performs slow-path network protocol 
processing, the slow-path network protocol processing 
being performed substantially in software; and 
network protocol accelerator circuitry, the network pro- 
tocol accelerator circuitry receiving a first network 
communication, the first network communication 
including a TCP header, the network protocol accel- 
erator circuitry performing fast-path network process- 
ing on the first network communication such that the 
processor performs substantially no TCP protocol pro- 
cessing on the first network communication, the net- 
work protocol accelerator circuitry receiving a second 
network communication, the second network commu- 
nication including a TCP header, the second network 
communication being associated with a connection, 
wherein the processor assumes control of the connec- 
tion from the network protocol accelerator circuitry, the 
processor then performing slow-path network protocol 
processing on the second network communication. 
2. The system of claim 1, wherein the connection is 
identified by a TCP source port in the second network 
communication, a TCP destination port in the second net- 
work communication, an IP source address in the second 
network communication, and an IP destination address in the ' 
second network communication. 
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3. The system of claim 1, wherein the second network 
communication is initially fast-path processed by the net- 
work protocol accelerator circuitry, and wherein the second 
network communication is subsequently slow-path pro- 

5 cessed by the processor. 

4. The system of claim 1, wherein the first network 
communication includes a data payload, the network proto- 
col accelerator circuitry receiving the first network 
communication, determining a final destination, and then 

, 0 placing the data payload into the final destination without 
the processor performing a substantial amount of fast-path 
processing on the first network communication. 

5. Aprotocol accelerator integrated circuit that operates in 
conjunction with a processor, the processor executing a 
network protocol stack, the protocol accelerator integrated 
circuit receiving a first network communication from a 
network at substantially the same time that it outputs a 
second network communication to the network, the first 
network communication including a TCP header and an IP 
header, the second network communication including a TCP 
header and an IP header, the protocol accelerator integrated 
circuit comprising a pipeline of processors, the pipeline of 
processors including a receive processor and a transmit 
processor, the receive processor performing protocol pro- 
cessing on the first network communication such that the 
network protocol stack performs substantially no TCP pro- 
tocol processing on the first network communication and 
such that the network protocol stack performs substantially 
no IP protocol processing on the first network 
communication, the transmit processor performing protocol 
processing on the second network communication such that 
the network protocol stack performs substantially no TCP 
protocol processing on the second network communication 
and such that the network protocol stack performs substan- 
tially no IP protocol processing on the second network 
communication. 

6. The protocol accelerator integrated circuit of claim 5, 
wherein the protocol accelerator integrated circuit is dis- 
posed on a card, and wherein the processor is part of a host 
computer, the card being coupled to the host computer. 

***** 
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